How can I check the strings tokenized inside TfidfVertorizer()? If I don't pass anything in the arguments, TfidfVertorizer() will tokenize the string with some pre-defined methods. I want to observe how it tokenizes strings so that I can more easily tune my model.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
I want something like this:
>>>vectorizer.get_processed_tokens()
[['this', 'is', 'first', 'document'],
['this', 'document', 'is', 'second', 'document'],
['this', 'is', 'the', 'third', 'one'],
['is', 'this', 'the', 'first', 'document']]
How can I do this?
build_tokenizer() would exactly serve this purpose.
Try this!
tokenizer = lambda docs: [vectorizer.build_tokenizer()(doc) for doc in docs]
tokenizer(corpus)
Output:
[['This', 'is', 'the', 'first', 'document'],
['This', 'document', 'is', 'the', 'second', 'document'],
['And', 'this', 'is', 'the', 'third', 'one'],
['Is', 'this', 'the', 'first', 'document']]
One liner solution would be
list(map(vectorizer.build_tokenizer(),corpus))
I'm not sure there's a built in sklearn function to get your output in that format but I'm pretty sure a fitted TfidfVectorizer instance has a vocabulary_ attribute that returns a dictionary of the mapping of terms to feature indices. Read more here.
A combination of that and the output of the get_feature_names method should be able to do this for you. Hope it helps.
This might not be syntactically correct (doing this on memory), but its the general idea:
Y = X.to_array()
Vocab = vectorizer.get_feature_names()
fake_corpus = []
for doc in Y:
l = [Vocab[word_index] for word_index in doc]
fake_corpus.append(l)
With Y you have the indexs of your words for each doc in the corpus, with vocab you have the words a given index corresponds too, so you basically just need to combine them.
Related
I am interested in the finding of the same words in two lists. I have two lists of words in the text_list I also stemmed the words.
text_list = [['i', 'am', 'interest' ,'for', 'this', 'subject'], ['this', 'is', 'a', 'second', 'sentence']]
words_list = ['a', 'word', 'sentence', 'interesting']
So I need this output:
same_words= ['a', 'sentence', 'interest']
You need to apply stemming to both the lists, There are discrepancies for example interesting and interest and if you apply stemming to only words_list then Sentence becomes sentenc so, therefore, apply stemmer to both the lists and then find the common elements:
from nltk.stem import PorterStemmer
text_list = [['i', 'am', 'interest','for', 'this', 'subject'], ['this', 'is', 'a', 'second', 'sentence']]
words_list = ['a', 'word', 'sentence', 'interesting']
ps = PorterStemmer()
words_list = [ps.stem(w) for w in words_list]
text_list = [list(map(ps.stem,i)) for i in text_list]
answer = []
for i in text_list:
answer.append(list(set(words_list).intersection(set(i))))
output = sum(answer, [])
print(output)
>>> ['interest', 'a', 'sentenc']
There is a package called fuzzywuzzy which allows you to match the string from a list with the strings from another list with approximation.
First of all, you will need to flatten your nested list to a list/set with unique strings.
from itertools import chain
newset = set(chain(*text_list))
{'sentence', 'i', 'interest', 'am', 'is', 'for', 'a', 'second', 'subject', 'this'}
Next, from the fuzzywuzzy package, we import the fuzz function.
from fuzzywuzzy import fuzz
result = [max([(fuzz.token_set_ratio(i,j),j) for j in newset]) for i in words_list]
[(100, 'a'), (57, 'for'), (100, 'sentence'), (84, 'interest')]
by looking at here, the fuzz.token_set_ratio actually helps you to match the every element from the words_list to all the elements in newset and gives the percentage of matching alphabets between the two elements. You can remove the max to see the full list of it. (Some alphabets in for is in the word, that's why it's shown in this tuple list too with 57% of matching. You can later use a for loop and a percentage tolerance to remove those matches below the percentage tolerance)
Finally, you will use map to get your desired output.
similarity_score, fuzzy_match = map(list,zip(*result))
fuzzy_match
Out[40]: ['a', 'for', 'sentence', 'interest']
Extra
If your input is not the usual ASCII standard, you can put another argument in the fuzz.token_set_ratio
a = ['У', 'вас', 'є', 'чашка', 'кави?']
b = ['ви']
[max([(fuzz.token_set_ratio(i, j, force_ascii= False),j) for j in a]) for i in b]
Out[9]: [(67, 'кави?')]
write a function construct_ngrams(sentence, n) which takes input parameters sentence (type string) and n (type integer), and returns a list that contains N-gram generated from the given sentence. If no such N-gram could be generated (think about the cases), then it simply returns an empty list.
I have this so far
def construct_ngrams(sentence, n):
"""Returns a list that counts N-gram generated from the given sentence"""
words = sentence.split()
if n == 0 or n > len(words) -1:
return []
ngram = []
for i in range(n):
ngram.append(words[i:i+n])
return ngram
however this does not pass the following test:
ngrams = construct_ngrams('this is another long sentence for testing', 6)
print(ngrams)
it gives:
[['this', 'is', 'another', 'long', 'sentence', 'for'], ['is', 'another', 'long', 'sentence', 'for', 'testing'], ['another', 'long', 'sentence', 'for', 'testing'], ['long', 'sentence', 'for', 'testing'], ['sentence', 'for', 'testing'], ['for', 'testing']]
rather than:
[['this', 'is', 'another', 'long', 'sentence', 'for'], ['is', 'another', 'long', 'sentence', 'for', 'testing']]
any one be able to help me fix this?
I found several mistakes in your code, first "return" can be only used at the end of a function. You can put ngram=[] outside the function instead of in the if statement. Here is the revised code, hope it can help you.
def construct_ngrams(sentence, n):
words = sentence.split()
ngram=[]
if n == 0 or n > len(words) -1:
pass
else:
for i in range(len(words)-n+1):
ngram.append(words[i:i+n])
return ngram
While making bigrams and trigrams, the code is somehow being executed in a way that the each letter is being considered instead of each word. Please let me know who to fix this! (items is the name of the file)
bigram_phrases = gensim.models.Phrases(items, min_count=5, threshold=50)
trigram_phrases = gensim.models.Phrases(bigram_phrases[items], threshold=50)
bigram= gensim.models.phrases.Phraser(bigram_phrases)
trigram= gensim.models.phrases.Phraser(trigram_phrases)
def make_bigrams(texts):
return ([bigram[doc] for doc in texts])
def make_trigram(texts):
return ([trigram[bigram[doc]] for doc in texts])
data_bigrams = make_bigrams(items)
data_bigrams_trigrams = make_trigram(data_bigrams)
print(data_bigrams_trigrams)
The Output was being displayed like
[['m','o','t','i','v',',','O','r','n','a' and so on
My guess is that your items variable contains an iterable of strings not an iterable of lists of strings which is what the gensim.models.phrases.Phrases class expects as the argument passed to its sentence parameter.
For example, your items variable might look like this:
items = ['this', 'is', 'a', 'sentence', 'this', 'one', 'is', 'different']
when it should look like:
items = [['this', 'is', 'a', 'sentence'], ['this', 'one', 'is', 'different']]
Here is my code:
count = CountVectorizer(lowercase = False)
vocabulary = count.fit_transform([words])
print(count.get_feature_names())
For example if:
words = "Hello #friend, this is a good day. #good."
I want it to be separated into this:
['Hello', '#friend', 'this', 'is', 'a', 'good', 'day', '#good']
Currently, this is what it is separated into:
['Hello', 'friend', 'this', 'is', 'a', 'good', 'day']
You can use the token_pattern parameter here from CountVectorizer as mentioned in the documentation:
Pass a regex to tell CountVectorizer what should be considered a word. Let's say in this case we tell CountVectorizer, even words with # or # should be a word. Then do:
count = CountVectorizer(lowercase = False, token_pattern = '[a-zA-Z0-9$&+,:;=?##|<>.^*()%!-]+')
Output:
['#good', '#friend', 'Hello', 'a', 'day', 'good', 'is', 'this']
I am trying to use the CountVectorizer module with Sci-kit Learn. From what I read, it seems like it can be used on a list of sentences, like:
['This is the first document.','This is the second second document.','And the third one.', 'Is this the first document?']
However, is there a way to vectorize a collection of words in list form, such as [['this', 'is', 'text', 'document', 'to', 'analyze'], ['and', 'this', 'is', 'the', 'second'],['and', 'this', 'and', 'that', 'are', 'third']?
I am trying to convert each list to a sentence using ' '.join(wordList), but I am getting an error:
TypeError: sequence item 13329: expected string or Unicode, generator
found
when I try to run:
vectorizer = CountVectorizer(min_df=50)
ratings = vectorizer.fit_transform([' '.join(wordList)])
thanks!
I guess you need to do this:
counts = vectorizer.fit_transform(wordList) # sparse matrix with columns corresponding to words
words = vectorizer.get_feature_names() # array with words corresponding to columns
Finally, to get [['this', 'is', 'text', 'document', 'to', 'analyze']]
sample_idx = 1
sample_words = [words[i] for i, count in
enumerate(counts.toarray()[sample_idx]) if count > 0]