I am using LOINC to match clinical document names. I am able to match Document types directly to LOINC longnames but some documents names am not able to match document type for example"operative report" in the image below.Is there any approach that we can match semantically for example "fever" and "temperature" are semantic words. Are there any other thesaurus or any clinical standrads where i can map them.
Example:
Related
I am working on a task to extract architects and their buildings from unstructured pieces of texts with varying sizes. I started trying a NLP tool called SpaCy but annotations they provide sometimes mixes up.
FAC Buildings, airports, highways, bridges, etc.
ORG Companies, agencies, institutions, etc.
GPE Countries, cities, states.
LOC Non-GPE locations, mountain ranges, bodies of water.
Building names falls into those 4 annotations. My job would be so much easier if i could get only FAC for building names but it looks like it is not possible or i couldn't be able to make it work.
The question is, is it even possible to use NLP tools to extract such information tuples(in my case {Architect, Building}) from a chunk of text?
Edit: Some things i have done
Following bits are some examples of texts i am using at the moment
He renovated Fatih Mosque and built Laleli Mosque in the name of
Sultan Mustafa III
Mehmed Tahir Ağa built Hamidiyye Complex in Bahçekapı for Sultan
Abdülhamid I.
I am giving those texts as data to the spaCy, code bit is here:
for i in range(len(data)):
text = data[i]
text = re.sub(r'\([^()]*\)', '', text)
doc = nlp(text)
#Extract ORG, GPE, LOC and FAC labels from phrases
for entity in doc.ents:
if entity.label_ in ('ORG', 'GPE', 'LOC', 'FAC'):
#Manual filtering of results
if entity.text not in ("Istanbul", "Egypt", "Hicaz", "Palestine", "Syria", "Balkans", "Albania", "Malta", "Spain", "Bosnia", "Frengistan", "Kırım", "Belgrade", "Damascus"):
print(entity.text, entity.label_)
Output is:
Laleli Mosque ORG
Hamidiyye Complex ORG
Bahçekapı for Sultan Abdülhamid I. ORG
It depends on how closely the text you're working with follows the structure of the data that SpaCy's default models were trained with. If they're very different, you might have to train your own model instead of using theirs. The guys behind SpaCy (explosion.ai) have a paid tool that can help you do this (prodi.gy). That said, it probably is possible to do what you want to do, but putting together a training set without tool support is not a very easy thing to do.
I have some JSON data converted into a Pandas DataFrame. I am looking to find all columns whose string content matches a list of multi word phrases.
I am working with a massive amount of Twitter JSON data already downloaded for public use (so Twitter API usage is not applicable). This JSON is converted into a Pandas DataFrame. One of the columns available is, text which the body of the tweet. An example is
We’re kicking off the first portion of a citywide traffic calming project to make residential streets more safe & pedestrian-friendly, next week!
Tuesday, July 30 at 10:30 AM
Nautilus Drive and 42 Street
I want to be able to have a list of phrases, phrases = ["We're kicking off", "we're starting", "we're initiating"] and do something like pd[pd['text'].str.contains(phrases)]] to ensure that I can obtain pandas DataFrame rows whose text column contains one of the phrases.
This is perhaps asking too much, but ideally I would also be able to match something like phrases = ["(We're| we are) kicking off", "(we're | we are) starting", "(we're| we are) initiating"]
Make a list with keywords or phrases you want to match, i have put on logic for perfect match, you can change it by changing regex. Also it will capture by which keywords was the text caught.
Here is the code -
for i in range(len(mustkeywords)):
for index in range(len(text)):
result = re.search(r'\s*\b'+mustkeywords[i]+r'\W\s*', text[index])
if result:
commentlist.append(text[index])
keywordlist.append(mustkeywords[i])
tempmustkeywordsdf=pd.DataFrame(columns={"Comments"},data=commentlist) #temp df for keywords
tempmustkeywordsdf["Keywords"]=keywordlist #adding keywords column to this df
Here mustkeywords is a list that contains your phrases or keywords
.text is a string that contains all the data/phrases that you want to check keywords into.
and tempmustkeywordsdf is that contains matched strings and keywords that matched them.
I hope this helps.
I am trying to train a doc2vec model on a corpus of six novels and I need to build the corpus of Tagged Documents.
Each novel is a txt file, already preprocessed and read into python using the read() method, so that it appears as a "long string". If I try to tag each novel using TaggedDocument form gensim, each novel gets only one tag, and the corpus of tagged documents has only six elements (which is not enough to train the doc2vec model).
I have been suggested to split each novel into sentences, then assign each sentence one tag for the ID of the sentence, and then one tag for the ID of the book it belongs to. I am, however, in trouble since I do not know how to structure the code.
This was the first code, i.e. the one using each novel in the format of a "long string":
`documents=[emma_text, persuasion_text, prideandprejudice_text,
janeeyre_text, shirley_text, professor_text]
corpus=[]`
`for docid, document in enumerate(documents):
corpus.append(TaggedDocument(document.split(), tags=
["{0:0>4}".format
(docid)]))`
`d2v_model = Doc2Vec(vector_size=100,
window=15,
hs=0,
sample=0.000001,
min_count=100,
workers=-1,
epochs=500,
dm=0,
dbow_words=1)
d2v_model.build_vocab(corpus)`
`d2v_model.train(corpus, total_examples=d2v_model.corpus_count,
epochs=d2v_model.epochs)`
This, however, means that my corpus of tagged documents has only six elements and that my model has not enough elements on which to train. If for instance I try to apply the .most_similar method to a target book, I get completely wrong results
To sum up, I need help to assign each sentence of each book (I have already split the books into sentences) one tag for the ID of the sentence and one tag for the ID of the book it belongs to, using TaggedDocument to build the corpus on which I will train my model.
Thanks for the attention!
I have some text from an XML document where I am trying to extract the text within tags contain certain words.
For example below:
search('adverse')
should return the text of all the tags containing the word 'adverse'
Out:
[
"<item>The most common adverse reactions reported in subjects receiving coadministered dutasteride and tamsulosin were impotence, decreased libido, breast disorders (including breast enlargement and tenderness), ejaculation disorders, and dizziness.</item>"
]
and search('clinical')
should return two results since two tags contain those words.
Out:
[
"<title>6.1 Clinical Trials Experience</title>",
"<paragraph id="ID41">The clinical efficacy and safety of coadministered dutasteride and tamsulosin, which are individual components of dutasteride and tamsulosin hydrochloride capsules, have been evaluated in a multicenter, randomized, double-blind, parallel group trial (the Combination with Alpha-Blocker Therapy, or CombAT, trial) </paragraph>"
]
What tools should I use for this? RegEx? BS4? Any suggestions are greatly appreciated.
Example Text:
</highlight>
</excerpt>
<component>
<section id="ID40">
<id root="fbc21d1a-2fb2-47b1-ac53-f84ed1428bb4"></id>
<title>6.1 Clinical Trials Experience</title>
<text>
<paragraph id="ID41">The clinical efficacy and safety of coadministered dutasteride and tamsulosin, which are individual components of dutasteride and tamsulosin hydrochloride capsules, have been evaluated in a multicenter, randomized, double-blind, parallel group trial (the Combination with Alpha-Blocker Therapy, or CombAT, trial) </paragraph>
<list id="ID42" listtype="unordered" stylecode="Disc">
<item>The most common adverse reactions reported in subjects receiving coadministered dutasteride and tamsulosin were impotence, decreased libido, breast disorders (including breast enlargement and tenderness), ejaculation disorders, and dizziness.</item>
You could either hardcode it with a regex, or parse your xml file with a library like lxml
With a regex that would be:
import re
your_text = "(...)"
def search(instr):
return re.findall(r"<.+>.*{}.*<.+>".format(instr), your_text, re.MULTILINE)
print(search("safety"))
I have a dictionary of the following structure:
keywords={topic_1:{category_1:['\"phrase_1\"','\"phrase_2\"'],
catgeory_2:[''\"phrase_1\"','\"phrase_2\"']},
topic_2:{category_1:['\"phrase_1\"','\"phrase_2\"','\"phrase_3\"']}}
I have a bunch of documents in mongodb on which I want tag a [category,topic] tag as long as it matches any one of the phrases in the [topic][category].However I need to iterate phrase by phrase as follows(Pymongo):
for topic in keywords:
for category in keywords[topic]:
for phrase in keywords[topic][category]:
docs=db.collection.find({'$text':{'$search':keyword}},{'_id':1})
Instead of this I just want to scan the list of phrases and give me a list of documents that match any one phrase for every [topic][category] list.Is this possible in pymongo?...Is OR-ing of phrases possible?-If so how do I go about it?I tried concatenating the phrases as a single string but that dint work.My actual Mongo collection has a million documents and the dictionary would be very large as well-The performance is brought down if I iterate phrase by phrase