get unmatched words after CountVectorizer transform - python

I am using count vectorizer to apply string matching in a large dataset of texts. What I want is to get the words that do not match any term in the resulting matrix. For example, if the resulting terms (features) after fitting are:
{'hello world', 'world and', 'and stackoverflow', 'hello', 'world', 'stackoverflow', 'and'}
and I transformed this text:
"oh hello world and stackoverflow this is a great morning"
I want to get the string oh this is a greate morining since it matches nothing in the features. Is there any efficient method to do this?
I tried using inverse_transform method to get the features and remove them from the text but I ran into many problems and long running time.

Transforming a piece of text on the basis of a fitted vocabulary is going to return you a matrix with counts of the known vocabulary.
For example, if your input document is as in your example:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(ngram_range=(1, 2))
docs = ['hello world and stackoverflow']
vec.fit(docs)
Then the fitted vocabulary might look as follows:
In [522]: print(vec.vocabulary_)
{'hello': 2,
'world': 5,
'and': 0,
'stackoverflow': 4,
'hello world': 3,
'world and': 6,
'and stackoverflow': 1}
Which represents a token to index mapping. Transforming some new documents subsequently returns a matrix with counts of all known vocabulary tokens. Words that are not in the vocabulary are ignored!
other_docs = ['hello stackoverflow',
'hello and hello',
'oh hello world and stackoverflow this is a great morning']
X = vec.transform(other_docs)
In [523]: print(X.A)
[[0 0 1 0 1 0 0]
[1 0 2 0 0 0 0]
[1 1 1 1 1 1 1]]
Your vocabulary consists of 7 items, hence the matrix X contains 7 columns. And we've transformed 3 documents, so its a 3x7 matrix. The elements of the matrix are counts, of how often the particular word occurred in the document. For example for the second document "hello and hello", we have a count of 2 in column 2 (0-indexed) and a count of 1 in column 0, which refer to "hello" and "and", respectively.
Inverse transform is then a mapping from features (i.e. indices) back to the vocabulary items:
In [534]: print(vec.inverse_transform([1, 2, 3, 4, 5, 6, 7]))
[array(['and', 'and stackoverflow', 'hello', 'hello world',
'stackoverflow', 'world', 'world and'], dtype='<U17')]
Note: This is now 1-indexed w.r.t. to the vocabulary indices printed above.
Now lets get to your actual question, which is identifying all out-of-vocabulary (OOV) items in a given input document. Its fairly straightforward using sets if you're only interested in unigrams:
tokens = 'oh hello world and stackoverflow this is a great morning'.split()
In [542]: print(set(tokens) - set(vec.vocabulary_.keys()))
{'morning', 'a', 'is', 'this', 'oh', 'great'}
Things are slightly more involved if you're also interested in bigrams (or any other n-gram where n > 1), as first you'd need to generate all bigrams from your input document (note there are various ways to generate all ngrams from an input document of which the following is just one):
bigrams = list(map(lambda x: ' '.join(x), zip(tokens, tokens[1:])))
In [546]: print(bigrams)
['oh hello', 'hello world', 'world and', 'and stackoverflow', 'stackoverflow this', 'this is', 'is a', 'a great', 'great morning']
This line looks fancy, but all it does is zip two lists together (with the second list starting at the second item), which results in a tuple such as ('oh', 'hello'), the map statement just joins the tuple by a single space in order to transform ('oh', 'hello') into 'oh hello', and subsequently the map generator is converted into a list. Now you can build the union of unigrams and bigrams:
doc_vocab = set(tokens) | set(bigrams)
In [549]: print(doc_vocab)
{'and stackoverflow', 'hello', 'a', 'morning', 'hello world', 'great morning', 'world', 'stackoverflow', 'stackoverflow this', 'is', 'world and', 'oh hello', 'oh', 'this', 'is a', 'this is', 'and', 'a great', 'great'}
Now you can do the same as with unigrams above to retrieve all OOV items:
In [550]: print(doc_vocab - set(vec.vocabulary_.keys()))
{'morning', 'a', 'great morning', 'stackoverflow this', 'is a', 'is', 'oh hello', 'this', 'this is', 'oh', 'a great', 'great'}
This now represents all unigrams and bigrams that are not in your vectorizer's vocabulary.

Related

What is the most pythonic way to split a string into contiguous, overlapping list of words

Say I had a sentence "The cat ate the mouse." I want to split the sentence with size = 2.
So the result array becomes:
["the cat", "cat ate", "ate the", "the mouse"]
If my size was 3, it should become:
["the cat ate", "cat ate the", "ate the mouse"]
My method I have right now uses tons of for loops and I'm not sure if there is a best way.
Using list slice, you can get sub-list.
>>> words = "The cat ate the mouse.".rstrip('.').split()
>>> words[0:3]
['The', 'cat', 'ate']
Use str.join to convert the list to a string joined by delimiter:
>>> ' '.join(words[0:3])
'The cat ate'
List comprehension provides a conside way to create words list:
>>> n = 2
>>> [' '.join(words[i:i+n]) for i in range(len(words) - n + 1)]
['The cat', 'cat ate', 'ate the', 'the mouse']
>>> n = 3
>>> [' '.join(words[i:i+n]) for i in range(len(words) - n + 1)]
['The cat ate', 'cat ate the', 'ate the mouse']
# [' '.join(words[0:3]), ' '.join(words[1:4]),...]
you can use nltk library to do all the job
import nltk
from nltk.util import ngrams
text = "The cat ate the mouse."
tokenize = nltk.word_tokenize(text)
bigrams = ngrams(tokenize,3)
for gram in bigrams:
print gram
what gives us:
('The', 'cat', 'ate')
('cat', 'ate', 'the')
('ate', 'the', 'mouse')
('the', 'mouse', '.')

How to vectorize a list of words python?

I am trying to use the CountVectorizer module with Sci-kit Learn. From what I read, it seems like it can be used on a list of sentences, like:
['This is the first document.','This is the second second document.','And the third one.', 'Is this the first document?']
However, is there a way to vectorize a collection of words in list form, such as [['this', 'is', 'text', 'document', 'to', 'analyze'], ['and', 'this', 'is', 'the', 'second'],['and', 'this', 'and', 'that', 'are', 'third']?
I am trying to convert each list to a sentence using ' '.join(wordList), but I am getting an error:
TypeError: sequence item 13329: expected string or Unicode, generator
found
when I try to run:
vectorizer = CountVectorizer(min_df=50)
ratings = vectorizer.fit_transform([' '.join(wordList)])
thanks!
I guess you need to do this:
counts = vectorizer.fit_transform(wordList) # sparse matrix with columns corresponding to words
words = vectorizer.get_feature_names() # array with words corresponding to columns
Finally, to get [['this', 'is', 'text', 'document', 'to', 'analyze']]
sample_idx = 1
sample_words = [words[i] for i, count in
enumerate(counts.toarray()[sample_idx]) if count > 0]

Generate bigrams with NLTK

I am trying to produce a bigram list of a given sentence for example, if I type,
To be or not to be
I want the program to generate
to be, be or, or not, not to, to be
I tried the following code but just gives me
<generator object bigrams at 0x0000000009231360>
This is my code:
import nltk
bigrm = nltk.bigrams(text)
print(bigrm)
So how do I get what I want? I want a list of combinations of the words like above (to be, be or, or not, not to, to be).
nltk.bigrams() returns an iterator (a generator specifically) of bigrams. If you want a list, pass the iterator to list(). It also expects a sequence of items to generate bigrams from, so you have to split the text before passing it (if you had not done it):
bigrm = list(nltk.bigrams(text.split()))
To print them out separated with commas, you could (in python 3):
print(*map(' '.join, bigrm), sep=', ')
If on python 2, then for example:
print ', '.join(' '.join((a, b)) for a, b in bigrm)
Note that just for printing you do not need to generate a list, just use the iterator.
The following code produce a bigram list for a given sentence
>>> import nltk
>>> from nltk.tokenize import word_tokenize
>>> text = "to be or not to be"
>>> tokens = nltk.word_tokenize(text)
>>> bigrm = nltk.bigrams(tokens)
>>> print(*map(' '.join, bigrm), sep=', ')
to be, be or, or not, not to, to be
Quite late, but this is another way.
>>> from nltk.util import ngrams
>>> text = "I am batman and I like coffee"
>>> _1gram = text.split(" ")
>>> _2gram = [' '.join(e) for e in ngrams(_1gram, 2)]
>>> _3gram = [' '.join(e) for e in ngrams(_1gram, 3)]
>>>
>>> _1gram
['I', 'am', 'batman', 'and', 'I', 'like', 'coffee']
>>> _2gram
['I am', 'am batman', 'batman and', 'and I', 'I like', 'like coffee']
>>> _3gram
['I am batman', 'am batman and', 'batman and I', 'and I like', 'I like coffee']

Dictionary and position list back to sentence

I've managed to get my program to store a sentence or two into a dictionary and at the same time create a word position list.
What I need to do now is recreate the original sentence just from the dictionary and the position list. I've done lots of searches but the results I'm getting are either not what I need or are to confusing and beyond me.
Any help would be much appreciated, thanks.
Here is my code so far:
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
print ('This is the sentence:', sentence)
punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
for punct in punctuation:
sentence = sentence.replace(punct," %s" % punct)
print ('This is the sentence with spaces before the punctuations:', sentence)
words_list = sentence.split()
print ('A list of the words in the sentence:', words_list)
dictionary = {}
word_pos_list = []
counter = 0
for word in words_list:
if word not in dictionary:
counter += 1
dictionary[word] = counter
word_pos_list.append(dictionary[word])
print ('The positions of the words in the sentence are:', word_pos_list)
John
While, as mentioned in comments, dictionaries are not sorted datastructures, if you are breaking up a sentence and indexing it into a dictionary and are trying to put it back together, you can try to use an OrderedDict from the collections library to do what you're doing.
That said, this is without any sort of further background or knowledge of how you are splitting your sentence (punctuation etc, I suggest looking into NLTP if you are doing any sort of natural language processing(NLP)).
from collections import OrderedDict
In [182]: def index_sentence(s):
.....: return {s.split(' ').index(i): i for i in s.split(' ')}
.....:
In [183]: def build_sentence_from_dict(d):
.....: return ' '.join(OrderedDict(d).values())
.....:
In [184]: s
Out[184]: 'See spot jump over the brown fox.'
In [185]: id = index_sentence(s)
In [186]: id
Out[186]: {0: 'See', 1: 'spot', 2: 'jump', 3: 'over', 4: 'the', 5: 'brown', 6: 'fox.'}
In [187]: build_sentence_from_dict(id)
Out[187]: 'See spot jump over the brown fox.'
In [188]:
To reconstruct from your list you have to reverse the location mapping:
# reconstruct
reversed_dictionary = {x:y for y, x in dictionary.items()}
print(' '.join(reversed_dictionary[x] for x in word_pos_list))
This can be done more nicely using a defaultdict (dictionary with predifined default value, in your case a list of locations for the word):
#!/usr/bin/env python3.4
from collections import defaultdict
# preprocessing
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
punctuation = ['()?:;,.!/"\'']
for punct in punctuation:
sentence = sentence.replace(punct," %s" % punct)
# using defaultdict this time
word_to_locations = defaultdict(list)
for part in enumerate(sentence.split()):
word_to_locations[part[1]].append(part[0])
# word -> list of locations
print(word_to_locations)
# location -> word
location_to_word = dict((y, x) for x in word_to_locations for y in word_to_locations[x])
print(location_to_word)
# reconstruct
print(' '.join(location_to_word[x] for x in range(len(location_to_word))))
It's not the randomness of dictionary keys that's the problem here, it's the failure to record every position at which a word was seen, duplicate or not. The following does that and then unwinds the dictionary to produce the original sentence, sans punctuation:
from collections import defaultdict
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
print ('This is the sentence:', sentence)
punctuation = set('()?:;\\,.!/"\'')
sentence = ''.join(character for character in sentence if character not in punctuation)
print ('This is the sentence with no punctuation:', sentence)
words = sentence.split()
print('A list of the words in the sentence:', words)
dictionary = defaultdict(list)
last_word_position = 0
for word in words:
last_word_position += 1
dictionary[word].append(last_word_position)
print('A list of unique words in the sentence and their positions:', dictionary.items())
# Now the tricky bit to unwind our random dictionary:
sentence = []
for position in range(1, last_word_position + 1):
sentence.extend([word for word, positions in dictionary.items() if position in positions])
print(*sentence)
The output of the various print() statements:
This is the sentence: This Sentence is a very, very good sentence. Did you like my very good sentence?
This is the sentence with no punctuation: This Sentence is a very very good sentence Did you like my very good sentence
A list of the words in the sentence: ['This', 'Sentence', 'is', 'a', 'very', 'very', 'good', 'sentence', 'Did', 'you', 'like', 'my', 'very', 'good', 'sentence']
A list of unique words in the sentence and their positions: dict_items([('Sentence', [2]), ('is', [3]), ('a', [4]), ('very', [5, 6, 13]), ('This', [1]), ('my', [12]), ('Did', [9]), ('good', [7, 14]), ('you', [10]), ('sentence', [8, 15]), ('like', [11])])
This Sentence is a very very good sentence Did you like my very good sentence

Determining the position of sub-string in list of strings

I have a list of words (strings), say:
word_lst = ['This','is','a','great','programming','language']
And a second list with sub-strings, say:
subs_lst= ['This is', 'language', 'a great']
And let's suppose each sub-string in subs_lst appears only one time in word_lst. (sub-strings can be of any length)
I want an easy way to find the hierarchical position of the sub-strings in the word_lst.
So what I want is to order subs_lst according to they appearance in word_lst.
In the previous example, the output would be:
out = ['This is', 'a great', language]
Does anyone know an easy way to do this?
There's probably a faster way to do this, but this works, at least:
word_lst = ['This','is','a','great','programming','language']
subs_lst= ['This is', 'language', 'a great']
substr_lst = [' '.join(word_lst[i:j]) for i in range(len(word_lst)) for j in range(i+1, len(word_lst)+1)]
sorted_subs_list = sorted(subs_lst, key=lambda x:substr_lst.index(x))
print sorted_subs_list
Output:
['This is', 'a great', 'language']
The idea is to build a list of every substring in word_lst, ordered so that all the entries that start with "This" come first, followed by all the entries starting with "is", etc.. We store that in substr_lst.
>>> print substr_lst
['This', 'This is', 'This is a', 'This is a great', 'This is a great programming', 'This is a great programming language', 'is', 'is a', 'is a great', 'is a great programming', 'is a great programming language', 'a', 'a great', 'a great programming', 'a great programming language', 'great', 'great programming', 'great programming language', 'programming', 'programming language', 'language']
Once we have that list, we sort subs_list, using the index of each entry in substr_list as the key to sort by:
>>> substr_lst.index("This is")
1
>>> substr_lst.index("language")
20
>>> substr_lst.index("a great")
12
The intermediate step seems unneeded to me. Why not just make the word list a single string and find the substrings in that?
sorted(subs_lst, key = lambda x : ' '.join(word_lst).index(x))

Categories

Resources