Analyze semantic generality of sentence with python - python

I'm looking to analyze how specific a statement is. I've checked out packages like NLTK but haven't found anything that seems to fit. I'm looking for something that can give an English sentence a score of how specific or general it is.
An example of a specific sentence:
"The box is green and weighed one pound last week."
And example of a general sentence:
"Red is a color."
Any suggestions or ideas?

Related

NLTK - distinguishing between colors and words using context

I'm writing a program to analyze the usage of color in text. I want to search for color words such as "apricot" or "orange". For example, an author might write "the apricot sundress billowed in the wind." However, I want to only count the apricots/oranges that actually describe color, not something like "I ate an apricot" or "I drank orange juice."
Is there anyway to do this, perhaps using context() in NLTK?
Welcome to the broad field of homonymy, polysemy and WSD. In corpus linguistics, this is an approach where collocations e.g. and are used to determine a probability of the juice having the colour "orange" or being made of the respective fruit. Both probabilities are high, but the probability of "jacket" being made of the respective fruit should be much lower. There are different methods to be used. You could ask corpus annotators (specialists, crowdsourcing etc.) to annotate data in a text, which you can use to train your (machine learning) model, in this case a simple classifier. Otherwise you could use large text data to gather collocation counts in combinition with Wordnet, which may give you semantic information whether it is usual for a jacket to be made of fruits. A fortunate detail is that only rarely people use stereotypical colours in text, so you don't have to care for cases like "the yellow banana".
Shallow parsing may also help, since colour adjectives should be preferrably used in attributive position.
A different approach would be to use word similarity measures (vector space semantics)
or embeddings for Word Sense disambiguation (WSD).
Maybe this helps:
https://web.stanford.edu/~jurafsky/slp3/slides/Chapter18.wsd.pdf
https://towardsdatascience.com/a-simple-word-sense-disambiguation-application-3ca645c56357

Find similar/synonyms/context words Python

Hello i'm looking to find a solution of my issue :
I Want to find a list of similar words with french and english
For example :
name could be : first name, last name, nom, prénom, username....
Postal address could be : city, country, street, ville, pays, code postale ....
The other answer, and comments, describe how to get synonyms, but I think you want more than that?
I can suggest two broad approaches: WordNet and word embeddings.
Using nltk and wordnet, you want to explore the adjacent graph nodes. See http://www.nltk.org/howto/wordnet.html for an overview of the functions available. I'd suggest that once you've found your start word in Wordnet, follow all its relations, but also go up to the hypernym, and do the same there.
Finding the start word is not always easy:
http://wordnetweb.princeton.edu/perl/webwn?s=Postal+address&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=
Instead it seems I have to use "address": http://wordnetweb.princeton.edu/perl/webwn?s=address&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=
and then decide which of those is the correct sense here. Then try clicking the hypernym, hyponym, sister term, etc.
To be honest, none of those feels quite right.
Open Multilingual WordNet tries to link different languages. http://compling.hss.ntu.edu.sg/omw/ So you could take your English WordNet code, and move to the French WordNet with it, or vice versa.
The other approach is to use word embeddings. You find the, say, 300 dimensional, vector of your source word, and then hunt for the nearest words in that vector space. This will be returning words that are used in similar contexts, so they could be similar meaning, or similar syntactically.
Spacy has a good implementation, see https://spacy.io/usage/spacy-101#vectors-similarity and https://spacy.io/usage/vectors-similarity
Regarding English and French, normally you would work in the two languages independently. But if you search for "multilingual word embeddings" you will find some papers and projects where the vector stays the same for the same concept in different languages.
Note: the API is geared towards telling you how two words are similar, not finding similar words. To find similar words you need to take your vector and compare with every other word vector, which is O(N) in the size of the vocabulary. So you might want to do this offline, and build your own "synonyms-and-similar" dictionary for each word of interest.
from PyDictionary import PyDictionary
dictionary=PyDictionary()
answer = dictionary.synonym(word)
word is the word for which you are finding the synonyms.

Which technique is most appropriate for identifing various sentiments in the same text using python?

I'm studying NLP and as example I'm trying to identify what feelings are in customer feedback in the online course platform.
I was able to identify the feelings of the students with only simple sentence, such as "The course is very nice, I learned a lot from it", "The teaching platform is complete and I really enjoy using it", "I could have more courses related to marine biology", and so on.
My doubt is how to correctly identify the various sentiments in one sentence or in several sentences. For example:
A sentiment per sentence:
"The course is very good! it could be cool to create a section of questions on the site."
More than one sentiment per sentence:
"The course is very good, but the site is not."
Involving both:
"The course is very good, but the teaching platform is very slow. There could be more tasks and examples in the courses, interaction by video or microphone on the forum, for example."
I thought of splitting text in sentences, but it is not so good for the example 2.
You can think that comas, other punctuation marcs and some conjunctions and prepositions actually split sentences. This actually goes beyond code into the field of linguistics as they sometimes, but not always, separate sentences.
In the 2nd case you actually have two sentences: "The course is very good" -, but- "The site is not [very good]".
I believe there are NPL packages that can split sentences (Probably by knowing that most sentences follow the subject/predicate/object structure, so if you wind more than one verb then propbably you'll find the same ammount of sentences) and you could use those to parse your text first. Take a look for libraries doing that for your language of choice.
There is a lib specific for multi-label classification:
scikit-multilearn
when you train your model you have to split classes into binary columns.

pronoun resolution backwards

The usual coreference resolution works in the following way:
Provided
The man likes math. He really does.
it figures out that
he
refers to
the man.
There are plenty of tools to do this.
However, is there a way to do it backwards?
For example,
given
The man likes math. The man really does.
I want to do the pronoun resolution "backwards,"
so that I get an output like
The man likes math. He really does.
My input text will mostly be 3~10 sentences, and I'm working with python.
This is perhaps not really an answer to be happy with, but I think the answer is that there's no such functionality built in anywhere, though you can code it yourself without too much difficulty. Giving an outline of how I'd do it with CoreNLP:
Still run coref. This'll tell you that "the man" and "the man" are coreferent, and so you can replace the second one with a pronoun.
Run the gender annotator from CoreNLP. This is a poorly-documented and even more poorly advertised annotator that tries to attach gender to tokens in a sentence.
Somehow figure out plurals. Most of the time you could use the part-of-speech tag: plural nouns get the tags NNS or NNPS, but there are some complications so you might also want to consider (1) the existence of conjunctions in the antecedent; (2) the lemma of a word being different from its text; (3) especially in conjunction with 2, the word ending in 's' or 'es' -- this can distinguish between lemmatizations which strip out plurals versus lemmatizations which strip out tenses, etc.
This is enough to figure out the right pronoun. Now it's just a matter of chopping up the sentence and putting it back together. This is a bit of a pain if you do it in CoreNLP -- the code is just not set up to change the text of a sentence -- but in the worst case you can always just re-annotate a new surface form.
Hope this helps somewhat!

Phrase corpus for sentimental analysis

Good day,
I'm attempting to write a sentimental analysis application in python (Using naive-bayes classifier) with the aim to categorize phrases from news as being positive or negative.
And I'm having a bit of trouble finding an appropriate corpus for that.
I tried using "General Inquirer" (http://www.wjh.harvard.edu/~inquirer/homecat.htm) which works OK but I have one big problem there.
Since it is a word list, not a phrase list I observe the following problem when trying to label the following sentence:
He is not expected to win.
This sentence is categorized as being positive, which is wrong. The reason for that is that "win" is positive, but "not" does not carry any meaning since "not win" is a phrase.
Can anyone suggest either a corpus or a work around for that issue?
Your help and insight is greatly appriciated.
See for example: "What's great and what's not: learning to classify the scope of negation for improved sentiment analysis" by Councill, McDonald, and Velikovich
http://dl.acm.org/citation.cfm?id=1858959.1858969
and followups,
http://scholar.google.com/scholar?cites=3029019835762139237&as_sdt=5,33&sciodt=0,33&hl=en
e.g. by Morante et al 2011
http://eprints.pascal-network.org/archive/00007634/
In this case, the work not modifies the meaning of the phrase expecteed to win, reversing it. To identify this, you would need to POS tag the sentence and apply the negative adverb not to the (I think) verb phrase as a negation. I don't know if there is a corpus that would tell you that not would be this type of modifier or not, however.

Categories

Resources