I am trying to extract special chunks of POS tags like this and many chunks from
different patterns are working good and similar sentences can be found using them.But the problem rises when I see the exact sequence of tags I have defined in tagged words as output as chunk but the machine could not find them with the name I have defined.Example:
{<VB><RB.?><VB><NN.?>+<IN>*<JJ.?>*<NN.?>*}
This would easily find sentence of :
Do not take money from internal relations
But when I have another pattern:
{<IN><DT>*<NN.?>+<VBZ><RB.?>*<JJ.?><CC>*<PRP$><NN.?>+<VBZ><JJ.?><TO><VB><CC>*<VBG><PRP><MD><VB>}
for the example:
if the present is not easy, or its size is difficult to quantify, but declining it would satisfy
It is not possible to detect it and shows it as a S only.Although I believe the pattern is exactly the same.Can this be because the clause I am looking for is sometimes in the beginning, sometimes in the middle and sometimes in the end of sentence?Can this be because I use I use PunktSentenceTokenizer?
Any help would be appreciated
Related
I just started learning how NLP works. What I can do right now is to get the number of frequency of a specific word per document. But what I'm trying to do is to compare the four documents that I have to compare their similarities and different as well as displaying the words that are similar and the words that is unique to each document.
My documents are in .csv format imported using pandas. As each row has their own sentiment.
To be honest, the question you're asking is very high level and difficult (maybe impossible) to answer on a forum like this. So here are some ideas that might be helpful:
You could try to use [term frequency–inverse document frequency (TFIDF)] (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to compare the vocabularies for similarities and differences. This is not a large step from your current word-frequency analysis.
For a more detailed analysis, it might be a good idea to substitute the words of your documents with something like wordnet's synsets. This makes it possible to compare the sentence meanings at a higher level of abstraction than the actual words themselves. For example, if each of your documents mentions "planes", "trains", and "automobiles", there is an underlying similarity (vehicle references) that a simple word comparison will ignore not be able to detect.
my goal is very simple: I have a set of strings or a sentence and I want to find the most similar one within a text corpus.
For example I have the following text corpus: "The front of the library is adorned with the Word of Life mural designed by artist Millard Sheets."
And I'd like to find the substring of the original corpus which is most similar to: "the library facade is painted"
So what I should get as output is: "fhe front of the library is adorned"
The only thing I came up with is to split the original sentence in substrings of variable lengths (eg. in substrings of 3,4,5 strings) and then use something like string.similarity(substring) from the spacy python module to assess the similarities of my target text with all the substrings and then keep the one with the highest value.
It seems a pretty inefficient method. Is there anything better I can do?
It probably works to some degree, but I wouldn't expect the spacy similarity method (averaging word vectors) to work particularly well.
The task you're working on is related to paraphrase detection/identification and semantic textual similarity and there is a lot of existing work. It is frequently used for things like plagiarism detection and the evaluation of machine translation systems, so you might find more approaches by looking in those areas, too.
If you want something that works fairly quickly out of the box for English, one suggestion is terp, which was developed for MT evaluation but shown to work well for paraphrase detection:
https://github.com/snover/terp
Most methods are set up to compare two sentences, so this doesn't address your potential partial sentence matches. Maybe it would make sense to find the most similar sentence and then look for substrings within that sentence that match better than the sentence as a whole?
I want to pull abstracts out of a large corpus of scientific papers using a python script. The papers are all saved as strings in a large csv. I want to something like this: extracting text between two headers I can write a regex to find the 'Abstract' heading. However, finding the next section heading is proving difficult. Headers vary wildly from paper to paper. They can be ALL CAPS or Just Capitalized. They can be one word or a long phrase and span two lines. They are usually followed by one-two newlines. This is what I came up with: -->
abst = re.findall(r'(?:ABSTRACT\s*\n+|Abstract\s*\n+)(.*?)((?:[A-Z]+|(?:\n(?:[A-Z]+|(?:[A-Z][a-z]+\s*)+)\n+)',row[0],re.DOTALL)
Here is an example of an abstract:
'...\nAbstract\nFactorial Hidden Markov Models (FHMMs) are powerful models for
sequential\ndata but they do not scale well with long sequences. We
propose a scalable inference and learning algorithm for FHMMs that
draws on ideas from the stochastic\nvariational inference, neural
network and copula literatures. Unlike existing approaches, the
proposed algorithm requires no message passing procedure among\nlatent
variables and can be distributed to a network of computers to speed up
learning. Our experiments corroborate that the proposed algorithm does
not introduce\nfurther approximation bias compared to the proven
structured mean-field algorithm,\nand achieves better performance with
long sequences and large FHMMs.\n\n1\n\nIntroduction\n\n...'
So I'm trying to find 'Abstract' and 'Introduction' and pull out the text that is between them. However it could be 'ABSTRACT' and 'INTRODUCTION', or ABSTRACT and 'A SINGLE LAYER NETWORK AND THE MEAN FIELD\nAPPROXIMATION\n'
Help?
Recognizing the next section is a bit vague - perhaps we can rely on Abstract-section ending with two newlines?
ABSTRACT\n(.*)\n\n
Or maybe we'll just assume that the next section-title will start with an uppercase letter and be followed by any number of word-characters. (Also that's rather vague, too, and assumes there'l be no \n\n within the Abstract.
ABSTRACT\n(.*)\n\n\U[\w\s]*\n\n
Maybe that stimulates further fiddling on your end... Feel free to post examples where this did not match - maybe we can stepwise refine it.
N.B: as Wiktor pointed out, I could not use the case-insensitive modifiers. So the whole rx should be used with switches for case-insenstive matching.
Update1: the challenge here is really how to identify that a new section has begun...and not to confuse that with paragraph-breaks within the Abstract. Perhaps that can also be dealt with by changing the rather tolerant [\w\s]*with [\w\s]{1,100} which would only recognize text in a new paragraph as a title of the "abstract-successor" if it had between 2 and 100 characters (note: 2 characters, although the limit is set to 1 because of the \U (uppercase character).
ABSTRACT\n(.*)\n\n\U[\w\s]{1,100}\n\n
I have a collection of 40-50 text files that contain markdown. Some of them contain duplicate words, sentences, and paragraphs. I'm looking for a script/algorithm to scan the files and help me identify matches (or near matches). Where can I find such a thing? Searching for this type of thing online yielded results for other types of problems, but not this one. Would appreciate any clues to help me narrow my search...
basically, a simple brute forces can solve all of your problems. But you should consider another algorithms depend on your requirement (timing, memory,...): Boyer–Moore, Rabin–Karp string search algorithm, Knuth–Morris–Pratt algorithm.
I'm currently working on a project, where I want to extract emotion from text. As I'm using conceptnet5 (a semantic network), I can't however simply prefix words in a sentence that contains a negation-word, as those words would simply not show up in conceptnet5's API.
Here's an example:
The movie wasn't that good.
Hence, I figured that I could use wordnet's lemma functionality to replace adjectives in sentences that contain negation-words like (not, ...).
In the previous example, the algorithm would detect wasn't and would replace it with was not.
Further, it would detect a negation-word not, and replace good with it's antonym bad.
The sentence would read:
The movie was that bad.
While I see that this isn't the most elegant way, and it does probably in many cases produce the wrong result, I'd still like to handle negation that way as I frankly don't know any better approach.
Considering my problem:
Unfortunately, I did not find any library that would allow me to replace all occurrences of appended negation-words (wasn't => was not).
I mean I could do it manually, by replacing the occurrences with a regex, but then I would be stuck with the english language.
Therefore I'd like to ask if some of you know a library, function or better method that could help me here.
Currently I'm using python nltk, still it doesn't seem that it contains such functionality, but I may be wrong.
Thanks in advance :)
Cases like wasn't can be simply parsed by tokenization (tokens = nltk.word_tokenize(sentence)): wasn't will turn into was and n't.
But negative meaning can also be formed by 'Quasi negative words, like hardly, barely, seldom' and 'Implied negatives, such as fail, prevent, reluctant, deny, absent', look into this paper. Even more detailed analysis can be found in Christopher Potts' On the negativity of negation
.
Considering your initial problem, sentiment analysis, most modern approaches, as far as I know, don't process negations explicitly; instead, they use supervised approaches with high-order n-grams. Those actually processing negation usually append special prefix NOT_ to all words between negation and punctuation marks.