Pymongo-text search-list of phrases - python

I have a dictionary of the following structure:
keywords={topic_1:{category_1:['\"phrase_1\"','\"phrase_2\"'],
catgeory_2:[''\"phrase_1\"','\"phrase_2\"']},
topic_2:{category_1:['\"phrase_1\"','\"phrase_2\"','\"phrase_3\"']}}
I have a bunch of documents in mongodb on which I want tag a [category,topic] tag as long as it matches any one of the phrases in the [topic][category].However I need to iterate phrase by phrase as follows(Pymongo):
for topic in keywords:
for category in keywords[topic]:
for phrase in keywords[topic][category]:
docs=db.collection.find({'$text':{'$search':keyword}},{'_id':1})
Instead of this I just want to scan the list of phrases and give me a list of documents that match any one phrase for every [topic][category] list.Is this possible in pymongo?...Is OR-ing of phrases possible?-If so how do I go about it?I tried concatenating the phrases as a single string but that dint work.My actual Mongo collection has a million documents and the dictionary would be very large as well-The performance is brought down if I iterate phrase by phrase

Related

How to add the processed text (token of words) column back to its original table using python

I would like to perform text analysis like world cloud and ngram on one of the text columns. I have broken down the sentence into tokens and want to join back it to the original table.
For example here are my two rows:
Code Text
ST-441 Purpose of your visit mentioned
St-432 Describe how and where it happened
after doing text cleaning on the text column I applied the following function
After applying the split function the sentence has broken down into words, one row has become n rows and wants to add it back to the original table using a unique identifier column.
def cleans(data):
tokens = list(map(lambda data: data.split(' '), text))
Now I got the list of tokens like 'purpose', 'your', 'visit', 'mentioned', 'described' ...
I am looking for the below output
Code Text
ST-441 Purpose
ST-441 your
ST-441 Visit
ST-441 mentioned
ST-432 Describe
ST-432 how
Any help would be much appreciated.

Map clinical Document names to LOINC names

I am using LOINC to match clinical document names. I am able to match Document types directly to LOINC longnames but some documents names am not able to match document type for example"operative report" in the image below.Is there any approach that we can match semantically for example "fever" and "temperature" are semantic words. Are there any other thesaurus or any clinical standrads where i can map them.
Example:

How to search for multiple multi-word phrases in pandas?

I have some JSON data converted into a Pandas DataFrame. I am looking to find all columns whose string content matches a list of multi word phrases.
I am working with a massive amount of Twitter JSON data already downloaded for public use (so Twitter API usage is not applicable). This JSON is converted into a Pandas DataFrame. One of the columns available is, text which the body of the tweet. An example is
We’re kicking off the first portion of a citywide traffic calming project to make residential streets more safe & pedestrian-friendly, next week!
Tuesday, July 30 at 10:30 AM
Nautilus Drive and 42 Street
I want to be able to have a list of phrases, phrases = ["We're kicking off", "we're starting", "we're initiating"] and do something like pd[pd['text'].str.contains(phrases)]] to ensure that I can obtain pandas DataFrame rows whose text column contains one of the phrases.
This is perhaps asking too much, but ideally I would also be able to match something like phrases = ["(We're| we are) kicking off", "(we're | we are) starting", "(we're| we are) initiating"]
Make a list with keywords or phrases you want to match, i have put on logic for perfect match, you can change it by changing regex. Also it will capture by which keywords was the text caught.
Here is the code -
for i in range(len(mustkeywords)):
for index in range(len(text)):
result = re.search(r'\s*\b'+mustkeywords[i]+r'\W\s*', text[index])
if result:
commentlist.append(text[index])
keywordlist.append(mustkeywords[i])
tempmustkeywordsdf=pd.DataFrame(columns={"Comments"},data=commentlist) #temp df for keywords
tempmustkeywordsdf["Keywords"]=keywordlist #adding keywords column to this df
Here mustkeywords is a list that contains your phrases or keywords
.text is a string that contains all the data/phrases that you want to check keywords into.
and tempmustkeywordsdf is that contains matched strings and keywords that matched them.
I hope this helps.

Doc2vec on a corpus of novels: how do I assign to each sentence of a novel one tag for the ID of the sentence and one tag for the ID of the book?

I am trying to train a doc2vec model on a corpus of six novels and I need to build the corpus of Tagged Documents.
Each novel is a txt file, already preprocessed and read into python using the read() method, so that it appears as a "long string". If I try to tag each novel using TaggedDocument form gensim, each novel gets only one tag, and the corpus of tagged documents has only six elements (which is not enough to train the doc2vec model).
I have been suggested to split each novel into sentences, then assign each sentence one tag for the ID of the sentence, and then one tag for the ID of the book it belongs to. I am, however, in trouble since I do not know how to structure the code.
This was the first code, i.e. the one using each novel in the format of a "long string":
`documents=[emma_text, persuasion_text, prideandprejudice_text,
janeeyre_text, shirley_text, professor_text]
corpus=[]`
`for docid, document in enumerate(documents):
corpus.append(TaggedDocument(document.split(), tags=
["{0:0>4}".format
(docid)]))`
`d2v_model = Doc2Vec(vector_size=100,
window=15,
hs=0,
sample=0.000001,
min_count=100,
workers=-1,
epochs=500,
dm=0,
dbow_words=1)
d2v_model.build_vocab(corpus)`
`d2v_model.train(corpus, total_examples=d2v_model.corpus_count,
epochs=d2v_model.epochs)`
This, however, means that my corpus of tagged documents has only six elements and that my model has not enough elements on which to train. If for instance I try to apply the .most_similar method to a target book, I get completely wrong results
To sum up, I need help to assign each sentence of each book (I have already split the books into sentences) one tag for the ID of the sentence and one tag for the ID of the book it belongs to, using TaggedDocument to build the corpus on which I will train my model.
Thanks for the attention!

Generate WordCloud from multiple sets of text

Based on this question How to create a word cloud from a corpus in Python?, I a did build a word cloud, using amueller's library. However, I fail to see how I can feed the cloud with more that one text sets. Here is what I have tried so far:
wc = WordCloud(background_color="white", max_words=2000, mask=alice_mask,
stopwords=STOPWORDS.add("said"))
wc.generate(set_of_words)
wc.generate("foo") # this overwrites the previous line of code
# but I would like this to be appended to the set of words
I can not find any manual for the library, so I have no idea about how to proceed, do you? :)
In reality, as you see here: Dictionary with array of different types as value in Python, I have this data structure:
category = { "World news": [2, "foo bla content of", "content of 2nd article"],
"Politics": [1, "only 1 article here"],
...
}
and I would like to append to the world cloud "foo bla content of" and "content of 2nd article".
The easiest solution would be to regenerate the wordcloud with the updated corpus.
To build a corpus with the text contained in your category data structure (for all topics) you could use this comprehension:
# Update the corpus
corpus = " ".join([" ".join(value[1:]) for value in category.values()])
# Regenerate the word cloud
wc.generate(corpus)
To build the word cloud for a single key in your data structure (eg Politics):
# Update the corpus
corpus = " ".join(category["Politics"][1:])
# Regenerate the word cloud
wc.generate(corpus)
Explanation:
join glues multiple string together separated by a given delimeter
[1:] takes all the elements from a list except the first one
dict.values() gives a list of all the values in the dictionary
The expression " ".join([" ".join(value[1:]) for value in category.values()]) thus can be translated as:
First glue together all the elements per key except the first one (as it is a counter). Then glue together all the resulting strings.
From a brief skim over the class in https://github.com/amueller/word_cloud/blob/master/wordcloud/wordcloud.py there isn't an update method, so you would need either to regenerate the wordcloud or add an update method.
Easiest way would probably be to maintain the original source text, and add to the end of this, then regenerate.

Categories

Resources