Python: CountVectorizer ignores one letter word "I"

Python: CountVectorizer ignores one letter word "I" - python

I have a list called dictionary1. I use the following code to get sparse count matrices of texts:
cv1 = sklearn.feature_extraction.text.CountVectorizer(stop_words=None)
cv1.fit_transform(dictionary1)
I notice however that
list(set(dictionary1)-set(cv1.get_feature_names()))
results in ['i']. So "i" is in my dictionary but CountVectorizer ignores it (presumably some default setting discards one-char words). In the documentation I could not find such an option. Can someone point me to the problem? Indeed I would like to keep "i" in my analysis, as it could refer to more personal language.

A working work-around is passing the dictionary as the vocabulary directly (actually I don't know why I did not do thath in the first place). I.e.
cv1 = sklearn.feature_extraction.text.CountVectorizer(stop_words=[], vocabulary=dictionary1)
cv1._validate_vocabulary()
list(set(dictionary1)-set(cv1.get_feature_names())) then returns [].
In my original post, I should have mentioned that dictionary1 already is a list of unique tokens.

The default configuration tokenizes the string by extracting words of at least 2 letters.
Check out this link to see more details about sklearn vectorizers.
In your case, you should use a different tokenizer, not analyzer. For example, you can use TweetTokenizer from nltk library:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TweetTokenizer
corpus = [...some_texts...]
tk = TweetTokenizer()
vectorizer = CountVectorizer(tokenizer=tk.tokenize)
x = vectorizer.fit_transform(corpus)
For example, if corpus is defined as below, you would get:
corpus = ['I love ragdolls',
'I received a cat',
'I take it as my best friend']
vectorizer.get_feature_names()
> ['a', 'as', 'best', 'cat', 'friend', 'i', 'it', 'love', 'my', 'ragdolls', 'received', 'take']

Related

word tokenization takes too much time to run

I use Pythainlp package to tokenize my Thai language data for doing sentiment analysis.
first, I build a function to add new words set and tokenize it
from pythainlp.corpus.common import thai_words
from pythainlp.util import dict_trie
from pythainlp import word_tokenize
def text_tokenize(Mention):
new_words = {'คนละครึ่ง', 'ยืนยันตัวตน', 'เติมเงิน', 'เราชนะ', 'เป๋าตัง', 'แอปเป๋าตัง'}
words = new_words.union(thai_words())
custom_dictionary_trie = dict_trie(words)
dataa = word_tokenize(Mention, custom_dict=custom_dictionary_trie, keep_whitespace=False)
return dataa
after that I apply it within my text_process function which including remove punctuation and stop words.
puncuations = '''.?!,;:-_[]()'/<>{}\##$&%~*ๆฯ'''
from pythainlp import word_tokenize
def text_process(Mention):
final = "".join(u for u in Mention if u not in puncuations and ('ๆ', 'ฯ'))
final = text_tokenize(final)
final = " ".join(word for word in final)
final = " ".join(word for word in final.split() if word.lower not in thai_stopwords)
return final
dff['text_tokens'] = dff['Mention'].apply(text_process)
dff
the point is it takes too long to run this function. it took 17 minutes and still not finished. I tried to replace
final = text_tokenize(final) with final = word_tokenize(final)
and it took just 2 minutes but I can't no longer use it because I need to add new custom dictionary. I know there is something wrong but really don't know how to fix it
I am new to python and nlp so please help.
Ps. sorry for my broken English

I am not familiar with Thai language, but assume that for tokenization you can also use language agnostic tokenization tools.
If you want to perform word tokenization, try the example below:
from nltk.tokenize import word_tokenize
s = '''This is the text I want to tokenize'''
word_tokenize(s)
>>> ['This', 'is', 'the', 'text', 'I', 'want', 'to', 'tokenize']

How to generate bigram/trigram corpus only

Is there a way for Gensim to generate strictly the bigrams, trigrams in a list of words?
I can successfully generate the unigrams, bigrams, trigrams but I would like to extract only the bigrams, trigrams.
For example, in the list below:
words = [['the', 'mayor', 'of', 'new', 'york', 'was', 'there'],["i","love","new","york"],["new","york","is","great"]]
I use
bigram = gensim.models.Phrases(words, min_count=1, threshold=1)
bigram_mod = gensim.models.phrases.Phraser(bigram)
words_bigram = [bigram_mod[doc] for doc in words]
This creates a list of unigrams and bigrams as follows:
[['the', 'mayor', 'of', 'new_york', 'was', 'there'],
['i', 'love', 'new_york'],
['new_york', 'is', 'great']]
My question is, is there a way (other than regular expressions) to extract strictly the bigrams, so that in this example only "new_york" would be a result?

It's not a built-in option of the gensim Phrases functionality.
If we can assume none of your original unigrams had the '_' character in them, a step to select only tokens with a '_'shouldn't be too expensive (and doesn't need full regular expressions). For example, your last line could be:
words_bigram = [ [token for token in bigram_mod[doc] if '_' in token] for doc in words ]
(You could change the joining character if for some reason there were underscores in your unigrams, and you didn't want those confused with Phrases-combined bigrams.)
If none of that is good enough, you could potentially look at the code in gensim which actually scores & combines unigrams into bigrams...
https://github.com/RaRe-Technologies/gensim/blob/fbc7d0952f1461fb5de3f6423318ae33d87524e3/gensim/models/phrases.py#L300
...and either extend that module with your extra needed option, or mimic its behavior outside the class in your own code.

Is there a way in the text_to_word_sequence method in Keras to also filter out stopwords using the 'filters' parameter?

I've reviewed the official documentation of the text_to_word_sequence method in Keras here
The code listed in the documentation is:
keras.preprocessing.text.text_to_word_sequence(text, filters='!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n', lower=True, split=' ')
I was wondering if there is a way to add something in the filters parameter that would also remove stopwords (such as from the nltk list of stopwords) i.e.
from nltk.corpus import stopwords
stopwords.words('English')
I am aware that we can also remove stopwords via regular expressions (using a _sre.SRE_Pattern) as below:
import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('English')) + r')\b\s*')
phrase = pattern.sub('', phrase)
My minimum verifiable example is:
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence
text_to_word_sequence("The cat is in the hat!!!")
Output: ['the', 'cat', 'is', 'in', 'the', 'hat']
I would like the output to be :
['cat', 'hat']
My question is this:
Is there a way using the filters parameter in the text_to_word_sequence method to automatically filter out stopwords along with the special characters that it filters out by default? Such as by using a pattern (_sre.SRE_Pattern) etc?

Getting words out of a numpy array of sentence strings

I have a numpy array of sentences (strings)
arr = np.array(['It's the most wonderful time of the year.',
'With the kids jingle belling.',
'And everyone telling you be of good cheer.',
'It's the hap-happiest season of all.'])
(that I read from a csv file). I need to make a numpy array with all the unique words in these sentences.
So what I need is
array(["It's", "the", "most", "wonderful", "time", "of" "year", "With", "the", "kids", "jingle", "belling" "and", "everyone", "telling", "you", "be", "good", "cheer", "It's", "hap-happiest", "season", "all"])
I could do this like
o = []
for x in arr:
o += x.split()
words = np.array(o)
unique_words = np.array(list(set(words.tolist())))
but as this involves first making lists and then converting that to numpy array, it's obviously gonna be slow and inefficient for large data.
I also tried nltk as in
words = np.array([])
for x in arr:
words = np.append(words, nltk.word_tokenize(x))
but with this too seems inefficient as a new array is created on each iteration instead of the old one being modified.
I suppose there's some elegant way of achieving what I want using more of numpy.
Can you point me in the right direction?

I think you can try something like this:
vocab = set()
for x in arr:
vocab.update(nltk.word_tokenize(x))
set.update() takes an iterable to add elements to existing set.
Update:
Also, you can look at the working of CountVectorizer in scikit-learn which:
converts a collection of text documents to a matrix of token counts.
And it uses a dictionary to keep track of the unique words:
# raw_documents is an iterable of sentences.
for doc in raw_documents:
feature_counter = {}
# analyze will split the sentences into tokens
# and apply some preprocessing on them (like stemming, lemma etc)
for feature in analyze(doc):
try:
# vocabulary is a dictionary containing the words and their counts
feature_idx = vocabulary[feature]
...
...
And I think it works pretty efficiently. So I think you can also use a dict() instead of set. I am not familiar with working of NLTK, but I think that must also contain something equivalent to CountVectorizer.

I'm not sure numpy is the best way to go here. You can achieve what you want with nested lists and sets or dictionaries.
One useful thing to know is that the tokenizer methods from nltk can process a list of sentences, and will return a list of tokenized sentences. For example:
from nltk.tokenize import WordPunktTokenizer
wpt = WordPunktTokenizer()
tokenized = wpt.tokenize_sents(arr)
This will return a list of lists of the tokenized sentences in arr, i.e.:
[['It', "'", 's', 'the', 'most', 'wonderful', 'time', 'of', 'the', 'year', '.'],
['With', 'the', 'kids', 'jingle', 'belling', '.'],
['And', 'everyone', 'telling', 'you', 'be', 'of', 'good', 'cheer', '.'],
['It', "'", 's', 'the', 'hap', '-', 'happiest', 'season', 'of', 'all', '.']]
nltk comes with lots of different tokenizers, and so will give you options for how best to split the sentences into word tokens. You can then use something like the following to get the unique set of words / tokens:
unique_words = set()
for toks in tokenized:
unique_words.update(toks)

Getting the closest noun from a stemmed word

Short version:
If I have a stemmed word:
Say 'comput' for 'computing', or 'sugari' for 'sugary'
Is there a way to construct it's closest noun form?
That is 'computer', or 'sugar' respectively
Longer version:
I'm using python and NLTK, Wordnet to perform a few semantic similarity tasks on a bunch of words.
I noticed that most sem-sim scores work well only for nouns, while adjectives and verbs don't give any results.
Understanding the inaccuracies involved, I wanted to convert a word from its verb/adjective form to its noun form, so I may get an estimate of their similarity (instead of the 'NONE' that normally gets returned with adjectives).
I thought one way to do this would be to use a stemmer to get at the root word, and then try to construct the closest noun form of that root.
George-Bogdan Ivanov's algorithm from here works pretty well. I wanted to try alternative approaches. Is there any better way to convert a word from adjective/verb form to noun form?

You might want to look at this example:
>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> WordNetLemmatizer().lemmatize('having','v')
'have'
(from this SO answer) to see if it sends you in the right direction.

First extract all the possible candidates from wordnet synsets.
Then use difflib to compare the strings against your target stem.
>>> from nltk.corpus import wordnet as wn
>>> from itertools import chain
>>> from difflib import get_close_matches as gcm
>>> target = "comput"
>>> candidates = set(chain(*[ss.lemma_names for ss in wn.all_synsets('n') if len([i for i in ss.lemma_names if target in i]) > 0]))
>>> gcm(target,candidates)[0]
A more human readable way to compute the candidates is as such:
candidates = set()
for ss in wn.all_synsets('n'):
for ln in ss.lemma_names: # get all possible lemmas for this synset.
for lemma in ln:
if target in lemma:
candidates.add(target)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: CountVectorizer ignores one letter word "I" - python

Related

word tokenization takes too much time to run

How to generate bigram/trigram corpus only

Is there a way in the text_to_word_sequence method in Keras to also filter out stopwords using the 'filters' parameter?

Getting words out of a numpy array of sentence strings

Getting the closest noun from a stemmed word

Categories

Resources