How to split SpaCy dependency tree into subclauses?

How to split SpaCy dependency tree into subclauses? - python

I am trying to split units of text by their dependency trees (according to SpaCy). I have experimented with much of the docs provided by spacy, but I cannot figure out how to accomplish this task. To visualize, see below:
import spacy
from spacy import displacy
doc = nlp('I was, I dont remember. Do you want to go home?')
dependency_flow = displacy.render(doc, style='dep', jupyter = True, options = {'disxatance': 120})
The code above results in this dependency tree graph (which is split into 2 screenshots due to size):
Intuitively, this indicates that there are 2 independent clauses in the original sentence. The original sentence was 'I was, I dont remember. Do you want to go home?', and it is effectively split into two clauses, 'I was, I dont remember.', and 'Do you want to go home?'.
Output
How, using SpaCy or any other tool, can I split the original utterance into those two clauses, so that the output is:
['I was, I dont remember.', 'Do you want to go home?']?
My current approach is rather lengthy and expensive. It involves finding the two biggest subtrees in the original text whose relative indices span the range of the original text indices, but I'm sure there is another, better way.

Given your input and output, i.e. a clause does not span multiple sentences. Then, instead of going down the dependency tree rabbit hole, it would be better to get the clauses as sentences(internally they are spans) from the doc.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp('I was, I dont remember. Do you want to go home?')
print([sent.text for sent in doc.sents])
Output
['I was, I dont remember.', 'Do you want to go home?']

Related

A way to separate one word into two separate words using NLTK?

If we have an unseparated word, let's say
doctorsofamerica
Is there an NLTK import that I can use to separate this into
doctors of america
Thanks!

If anything other than NLTK is an option, I used to work with Word Segmentation which gave pretty good results for simple cases. Regarding your use case, it would look like this:
from wordsegment import load, segment
load()
separated = segment('doctorsofamerica')
print(' '.join(separated))
Output:
doctors of america

Emotional score of sentences using Spacy

I have a series of 100.000+ sentences and I want to rank how emotional they are.
I am quite new to the NLP world, but this is how I managed to get started (adaptation from spacy 101)
import spacy
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
def set_sentiment(matcher, doc, i, matches):
doc.sentiment += 0.1
myemotionalwordlist = ['you','superb','great','free']
sentence0 = 'You are a superb great free person'
sentence1 = 'You are a great person'
sentence2 = 'Rocks are made o minerals'
sentences = [sentence0,sentence1,sentence2]
pattern2 = [[{"ORTH": emotionalword, "OP": "+"}] for emotionalword in myemotionalwordlist]
matcher.add("Emotional", set_sentiment, *pattern2) # Match one or more emotional word
for sentence in sentences:
doc = nlp(sentence)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print("Sentiment", doc.sentiment)
myemotionalwordlist is a list of about 200 words that Ive built manually.
My questions are:
(1-a) Counting the number of emotional words does not seem like the best approach. Anyone has any suggetions of a better way of doing so?
(1-b) In case this approach is good enough, any suggestions on how I can extract emotional words from wordnet?
(2) Whats the best way of escalating this? I am thinking about adding all sentences to a pandas data frame and then applying the match function to each one of them
Thanks in advance!

There are going to be two main approaches:
the one you have started, which is a list of emotional words, and counting how often they appear
showing examples of what you consider emotional sentences and what are unemotional sentences to a machine learning model, and let it work it out.
The first way will get better as you give it more words, but you will eventually hit a limit. (Simply due to the ambiguity and flexibility of human language, e.g. while "you" is more emotive than "it", there are going to be a lot of unemotional sentences that use "you".)
any suggestions on how I can extract emotional words from wordnet?
Take a look at sentiwordnet, which adds a measure of positivity, negativity or neutrality to each wordnet entry. For "emotional" you could extract just those that have either pos or neg score over e.g. 0.5. (Watch out for the non-commercial-only licence.)
The second approach will probably work better if you can feed it enough training data, but "enough" can sometimes be too much. Other downsides are the models often need much more compute power and memory (a serious issue if you need to be offline, or working on a mobile device), and that they are a blackbox.
I think the 2020 approach would be to start with a pre-trained BERT model (the bigger the better, see the recent GPT-3 paper), and then fine-tune it with a sample of your 100K sentences that you've manually annotated. Evaluate it on another sample, and annotate more training data for the ones it got wrong. Keep doing this until you get the desired level of accuracy.
(Spacy has support for both approaches, by the way. What I called fine-tuning above is also called transfer learning. See https://spacy.io/usage/training#transfer-learning Also googling for "spacy sentiment analysis" will find quite a few tutorials.)

Extract Text From Unstructured Medical Documents For NLP

I have a lot of unstructured medical documents in all sorts of different formats.
What's the best way to parse out all the good sentences to use for NLP?
Currently I'm using SpaCy to do this, but even with multiprocessing it is pretty slow, and and the default sentence parser doesn't work 100% of the time. Here is an example of how I try and get good sentences with SpaCy:
def get_good_sents(texts, batch_size, n_process):
nlp = spacy.load("en_core_web_sm", disable=[
'ner',
'entity_linker',
'textcat',
'entity_ruler',
'sentencizer',
'merge_noun_chunks',
'merge_entities',
'merge_subtokens',
])
pipe = nlp.pipe(texts, batch_size=batch_size, n_process=n_process)
rows = []
for doc in pipe:
clean_text = []
for sent in doc.sents:
struct = [token.pos_ for token in sent]
subject = any(x in struct for x in ['NOUN', 'PRON'])
action = any(x in struct for x in ['VERB', 'ADJ', 'AUX'])
if subject and action :
clean_text.append(sent.text)
rows.append(' '.join(clean_text).replace('\n', ' ').replace('\r', ''))
return rows
Example of some text extracts
Raw Text:
TITLE
Patient Name:
Has a heart Condition.
Is 70 Years old.
Expected Output:
Has a heart Condition.
Is 70 Years old.
This examples not great because I have tons of different documents in all sort of various formats. They can really vary a lot. It basically boils down to me just wanting to strip out the boiler plate stuff and just get the actual free text.

Based on the comments from the above discussion, I am very confident that spaCy will not provide you with very good results, simply because it is very much tied to the expectation of a valid grammatical sentence.
At least with the current approach of looking for "correctly tagged words" in each line, I would expect this to not work very well, since tagging a sentence correctly is already tied to a decent input format;
it is once again time to quote one of my favorite concepts in Machine Learning.
Depending on the accuracy you want to achieve, I would personally adopt a defensive Regex approach, where you manually sort out headings (lines with fewer than 4 words, lines that end in a colon/semicolon, etc.), although it will require significantly more effort.
Another, more direct solution would be to take what other common boilerplate tools are doing, although most of those are targeted to remove boilerplate from HTML content, and thus have an easier time by utilizing tag information as well.

MedSpaCy's section detection can be used for this:
https://github.com/medspacy/medspacy/tree/master/notebooks/section_detection
They have some great example notebooks to structure clinical/medical text documents.

How to count [any name from a list of names] + [specific last name] in a block of text?

first time post here. I’m hoping I can find a little help on something I’m trying to accomplish in terms of text analysis.
First, I’m doing this in python and would like to remain in python as this function would be part of a larger, otherwise healthy tool I’m happy with. I have NLKT and Anaconda all set up as well, so drawing on those resources is also possible.
I’ve been working on a tool that tracks and adds up references to city names in large blocks of text. For instance, the tool can count how many times “Chicago,” “New York” or “Los Angeles,” “San Francisco” etc… are detected in a text chunk and can rank them.
The current problem I am having is figuring out how to remove false positives from city names that are also last names. So, for instance, I would want to count, say Jackson Mississippi, but not count “Frank Jackson” “Jane Jackson” etc…
What I would like to do however is figure out a way to account for any false positive that might be [any name from a long list of first names] + [Select last name].
I have assembled a list of ~5000 first names from the census data that I can also bring into python as a list. I can also check true/false to find if a name is on that list, so I know I’m getting closer.
However, what I can’t figure out is how to express what I want, which is something like (I’ll use Jackson as an example again):
totalfirstnamejacksoncount = count (“[any name from census list] + Jackson”)
More or less. Is there some way I can phrase it as a wildcard from census list so ? Set a variable that would read as “any item in this list” so I could go “anynamevariable + Jackson,”? Or is there any other way to denote something like “any word in census list + Jackson”?
Ideally, my aim is to get a total count of “[Any first name] + [Specified last name]” so I can a) subtract them from the total of [Last name that is also a city name] count and maybe use that count for some other refinements.
In a worst case scenario I can see a way I could directly modify the census list and add Jackson (or whatever last name I need) to each name and have the lines manually add, but I feel like that would make a complete mess of my code when you look at ~5000 names for each name I’d like to do.
Sorry for the long-winded post. I appreciate your help with all this. If you have other suggestions you think might be better ways to approach it I’m happy to hear those out as well.

I propose to use regular expressions in combination with the list of names from NLTK. Suppose your text is:
text = "I met Fred Jackson and Mary Jackson in Jackson Mississippi"
Take a list of all names and convert it in a (huge) regular expression:
jackson_names = re.compile("|".join(w + r"\s+" + "Jackson" \
for w in nltk.corpus.names.words()))
In case you are not familiar with regular expressions, r'\s+' means "separated by one or more white spaces" and "|" means "or". The regular expression can be expanded to handle other last names.
Now, extract all "Jackson" matches from your text:
jackson_catch = jackson_names.findall(text)
#['Fred Jackson', 'Mary Jackson']
len(jackson_catch)
#2

Let's start by assuming that you are able to work with your data by iterating through words, e.g.
s = 'Hello. I am a string.'
s.split()
Output: ['Hello.', 'I', 'am', 'a', 'string.']
and you have managed to normalize the words by eliminating punctuation, capitalization, etc.
So you have a list of words words_list (which is your text converted into a list) and an index i at which you think there might be a city name, OR it might be someone's last name falsely identified as a city name. Let's call your list of first names FIRST_NAMES, which should be of type set (see comments).
if i >= 1:
prev_word = words_list[i-1]
if prev_word in FIRST_NAMES:
# put false positive code here
else:
# put true positive code here
You may also prefer to use regular expressions, as they are more flexible and more powerful. For example, you may notice that even after implementing this, you still have false positives or false negatives for some previously unforeseen reason. RE's could allow you to quickly adapt to the new problem.
On the other hand, if performance is a primary concern, you may be better off not using something so powerful and flexible, so that you can hone your algorithm to fit your specific requirements and run as efficiently as possible.

The current problem I am having is figuring out how to remove false positives from city names that are also last names. So, for instance, I would want to count, say Jackson Mississippi, but not count “Frank Jackson” “Jane Jackson” etc…
The problem you have is called "named entity recognition", and is best solved with a classifier that takes multiple cues into account to find the named entities and classify them according to type (PERSON, ORGANIZATION, LOCATION, etc., or a similar list).
Chapter 7 in the nltk book, and especially section 3, Developing and evaluating chunkers, walks you through the process of building and training a recognizer. Alternatively, you could install the Stanford named-entity recognizer and measure its performance on your data.

python: How to use POS (part of speech) features in scikit learn classfiers (SVM) etc

I want to use part of speech (POS) returned from nltk.pos_tag for sklearn classifier, How can I convert them to vector and use it?
e.g.
sent = "This is POS example"
tok=nltk.tokenize.word_tokenize(sent)
pos=nltk.pos_tag(tok)
print (pos)
This returns following
[('This', 'DT'), ('is', 'VBZ'), ('POS', 'NNP'), ('example', 'NN')]
Now I am unable to apply any of the vectorizer (DictVectorizer, or FeatureHasher, CountVectorizer from scikitlearn to use in classifier
Pls suggest

If I'm understanding you right, this is a bit tricky. Once you tag it, your sentence (or document, or whatever) is no longer composed of words, but of pairs (word + tag), and it's not clear how to make the most useful vector-of-scalars out of that.
Most text vectorizers do something like counting how many times each vocabulary item occurs, and then making a feature for each one:
the: 4, player: 1, bats: 1, well: 2, today: 3,...
The next document might have:
the: 0, quick:5, flying:3, bats:1, caught:1, bugs:2
Both can be stored as arrays of integers so long as you always put the same key in the same array element (you'll have a lot of zeros for most documents) -- or as a dict. So a vectorizer does that for many "documents", and then works on that.
So your question boils down to how to turn a list of pairs into a flat list of items that the vectorizors can count.
The most trivial way is to flatten your data to
('This', 'POS_DT', 'is', 'POS_VBZ', 'POS', 'POS_NNP', 'example', 'POS_NN')
The usual counting would then get a vector of 8 vocabulary items, each occurring once. I renamed the tags to make sure they can't get confused with words.
That would get you up and running, but it probably wouldn't accomplish much. That's because just knowing how many occurrences of each part of speech there are in a sample may not tell you what you need -- notice that any notion of which parts of speech go with which words is gone after the vectorizer does its counting.
Running a classifier on that may have some value if you're trying to distinguish something like style -- fiction may have more adjectives, lab reports may have fewer proper names (maybe), and so on.
Instead, you could change your data to
('This_DT', 'is_VBZ', 'POS_NNP', 'example_NN')
That keeps each tag "tied" to the word it belongs with, so now the vectors will be able to distinguish samples where "bat" is used as a verbs, from samples where it's only used as a noun. That would tell you slightly different things -- for example, "bat" as a verb is more likely in texts about baseball than in texts about zoos.
And there are many other arrangements you could do.
To get good results from using vector methods on natural language text, you will likely need to put a lot of thought (and testing) into just what features you want the vectorizer to generate and use. It depends heavily on what you're trying to accomplish in the end.
Hope that helps.

I know this is a bit late, but gonna add an answer here.
Depending on what features you want, you'll need to encode the POST in a way that makes sense. I've had the best results with SVM classification using ngrams when I glue the original sentence to the POST sentence so that it looks like the following:
word1 word2 word3 ... wordn POST1 POST2 POST3... POSTn
Once this is done, I feed it into a standard ngram or whatever else and feed that into the SVM.
This method keeps the information of the individual words, but also keeps the vital information of POST patterns when you give your system a words it hasn't seen before but that the tagger has encountered before.

What about merging the word and its tag like 'word/tag' then you may feed your new corpus to a vectorizer that count the word (TF-IDF or word of bags) then make a feature for each one:
wpt = nltk.WordPunctTokenizer()
text = wpt.tokenize('Someone should have this ring to a volcano')
text_tagged = nltk.pos_tag(text)
new_text = []
for word in text_tagged:
new_text.append(word[0] + "/" + word[1])
doc = ' '.join(new_text)
output for this is
Someone/NN should/MD have/VB this/DT piece/NN of/IN shit/NN to/TO a/DT volcano/NN

I think a better method would be to :
Step-1: Create word/sentence embeddings for each text/sentence.
Step-2: Calculate the POS-tags. Feed the POS-tags to a embedder as Step-1.
Step-3: Elementwise multiply the two vectors. (This is to ensure that the word-embeddings in each sentence is weighted by the POS-tags associated with it.
Thanks

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.