Python: Creating Term Document Matrix from list

Python: Creating Term Document Matrix from list - python

So I wanted to train a Naive Bayes Algorithm over some documents and the below code would just run fine if I had documents in the form of strings. But the issues is the strings I have goes through a series of pre-processing step which is more then stopword remove, lemmatization etc rather there are some custom conversion which returns a list of ngrams, where n can [1,2,3] depending on the context of text.
So now since I have list of ngram instead of a string representing a document I am confused how can I represent the same as an input to CountVectorizer.
Any suggestions?
Code that would work fine with docs as a document array of type string.
count_vectorizer = CountVectorizer(binary='true')
data = count_vectorizer.fit_transform(docs)
tfidf_data = TfidfTransformer(use_idf=False).fit_transform(data)
classifier = BernoulliNB().fit(tfidf_data,op)

You should combine all your pre-processing steps into preprocessor and maybe tokenizer functions, see section 4.2.3.10 and CountVectorizer description from scikit-learn docs. For example of such tokenizers/transformers see related question of src code of scikit-learn itself.

Related

Word2Vec Vocab Similarities

I ran a word2vec algo on text of about 750k words (before removing some stop words). Using my model, I started looking at the most similar words to particular words of my choosing, and the similarity scores (for model.wv.most_similar method) are all super close to 1. The tenth closest score is still like .998, so I feel like I'm not getting any significant differences between the similarity of words which leads to meaningless similar words.
My constructor for the model is
model = Word2Vec(all_words, size=75, min_count=30, window=10, sg=1)
I think the problem may lie in how I structure the text to run the neural net on. I store all the words like so:
all_sentences = nltk.sent_tokenize(v)
all_words = [nltk.word_tokenize(sent) for sent in all_sentences]
all_words = [[word for word in all_words[0] if word not in nltk.stopwords('English')]]
...where v is the result of calling read() on a txt file.

Have you looked at all_words, just before passing it to Word2Vec, to make sure it contains the size and variety of corpus you expected? (That last stop-word stripping step looks like it'll only operate on the very 1st sentence, all_words[0].)
Also, have you enabled logging at the INFO level, and watched the output for indicators of the model's final vocabulary size & training progress, to check if those values are as expected?
Note that removing stopwords isn't strictly necessary for word2vec training. Their presence doesn't hurt much, and the default frequent-word downsampling, controlled by the sample parameter, already serves to often-ignore very-frequent words like stopwords.
(Also, min_count=30 is fairly aggressive for a smallish corpus.)

Based on my knowledge, I recommend the following:
Use sg=0 to use the continuous bag of word model instead of the skip-gram model. CBOW is better at smaller dataset. The skip-gram model was trained in the official paper over 1 billion words.
Use min_count=5 which is the one they used in the paper and they had 1 billion. I think 30 is way too much for your data.
Don't remove the stop words as it will change the neighboring words in the moving window.
Use more iterations like iter=10 for example.
Use gensim.utils.simple_preprocess instead of word_tokenize as the punctuation isn't helpful in this case.
Also, I recommend split your dataset into paragraphs instead of sentences, but I don't know if this is applicable in your dataset or not
When following these steps, your code should be:
>>> from gensim.utils import simple_preprocess
>>> all_sentences = nltk.sent_tokenize(v)
>>> all_words = [simple_preprocess(sent) for sent in all_sentences]
>>> # define the model
>>> model = Word2Vec(all_words, size=75, min_count=5, window=10, sg=0, iter=10)

Preprocessing textual data into integer indices (like the imdb data set in the tensorFlow text-classification example)

I've been following the TensorFlow text classification tutorial (https://www.tensorflow.org/tutorials/keras/basic_text_classification), classifying the IMDB reviews.
The IMDB data is a part of the keras distribution, and is downloaded pre-processed.
I would like to experiment with my own texts. Is there an efficient way to pre-process my own texts into the word->int representation? I have tried using dictionaries, tuples and sorting, but it is very inefficient. I have a feeling there is a more efficient way to do it.
I have scanned the nltk and the keras pre-processing tools, but may have overlooked something there.

For a simple conversion from text sequences to integer sequences, we can use the keras.preprocessing.text.Tokenizer module.
The Tokenizer assigns a index ( not zero ) to each word present in the corpus. Using this vocabulary, the texts are tokenized.
Suppose, texts is the list of sentences which you have. Then,
tokenizer = keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts( texts )
tokenized_messages = tokenizer.texts_to_sequences( texts )
padded_messages = keras.preprocessing.sequence.pad_sequences( tokenized_messages , maxlen )
Where maxlen is the maximum length to which the tokenized message will be padded ( mostly by adding zeros ).

Tensorflow: Tokenizing bi-grams and n-grams using the Tensorflow Datasets utilities

A number of text classification models and embedding models use uni-gram, bi-gram, and n-grams as tokens for analysis. I found a way to use the tfds.features.text.Tokenizer() to extract uni-grams or words from some textual data. However, I wanted to see if there is a way to use the Tokenizer to extract bi-grams or n-grams from the text? I checked the documentation and did not see a setting for the size of each n-gram, but perhaps I missed something.
The code to extract n-grams comes from one of the tutorials on the Tensorflow website:
tokenizer = tfds.features.text.Tokenizer()
vocabulary_set = set()
for text_tensor, _ in all_labeled_data:
some_tokens = tokenizer.tokenize(text_tensor.numpy())
vocabulary_set.update(some_tokens)
vocab_size = len(vocabulary_set)
print(f'Vocabulary size is: {vocab_size}')

fast way to tokenize data in Python like TfidfVectorizer does

i used TfidfVectoriser to create tf-idf matrix
from sklearn.feature_extraction.text import TfidfVectorizer
transformer = TfidfVectorizer(smooth_idf=False,stop_words=stopwords.words('english'))
tfidf = transformer.fit_transform(raw_documents=sentences)
Now I want to transform each element of my sentences list to list of tokens, which were used in tfidfvectoriser. I tried to extract it directly from tfidf object via this function
def get_token(id, transformer_name, tfidf_obj):
return(np.array(transformer_name.get_feature_names())[\
tfidf_obj[id].toarray().reshape((tfidf_obj.shape[1],))!=0])
Where id is index of each sentence. In this function i tried to extract given row from tf-idf object, find non-zero elements and extract corresponding elements from transformer_name.get_feature_names() . Looks too complex =). Also this solution works very slow =/
Is there any way to get tokens using tfidfvectorizer's preprocessing and tokenization functions?

CountVectorizer matrix varies with new test data for classification?

I have created a model for text classification using python. I have CountVectorizer and it results in a document term matrix of 2034 rows and 4063 columns ( unique words ). I saved the model I used for new test data. My new test data
test_data = ['Love', 'python', 'every','time']
But the problem is I converted the above test data tokens into a feature vector, but it differs in shape. Because the model expect a 4063 vector. I know how to solve it by taking vocabulary of CountVectorizer and search for each token in test data and putting it in that index. But is there any easy way to handle this problem in scikit-learn itself.

You should not fit a new CountVectorizer on the test data, you should use the one you fit on the training data and call transfrom(test_data) on it.

You have two ways to solve this
1. you can use the same CountVectorizer that you used for your train features like this
cv = CountVectorizer(parameters desired)
X_train = cv.fit_transform(train_data)
X_test = cv.transform(test_data)
2. You can also creat another CountVectorizer, if you really want to(but not advisable since you would be wasting space and you'd still want to use the same parameters for your CV), and use the same feature.
cv_train = CountVectorizer(parameters desired)
X_train = cv_train.fit_transform(train_data)
cv_test = CountVectorizer(vocabulary=cv_train.get_feature_names(),desired params)
X_test = cv_test.fit_transform(test_data)

try to use:
test_features = inverse_transform(test_data)
this should return you what you wish for.

I added .toarray() to the wole command in order to see the results as a matrix.
so you should write:
X_test_analyst = Pipeline.named_steps['count_vectorizer'].transform(X_test).toarray()

I'm mega late for this discussion, but I just want to leave something for people come from the search engine.
Sorry for my bad English.
;)
As mention by #Andreas Mueller, you shouldn't create a new CountVectorizer with your new data(set), u can imagine what count vectorizer do is make a 2d array(or think as a excel table), every column is a unique word, every row representing a document(or sentence), and the value (i,j) means in i^th sentence, the frequency of j^th word.
If you make a new CountVectorizer using your new data, the unique word probably(if not must) be different. When u make model.predict using this data, it will report some sort of error telling u the dim are not correct.
What I did in my code is the following:
If you train your model in different .py / .ipynb file, you can use import pickle followed by dump function for your fitted count vectorizer. You can follow the detail in this post.
If you train your model in same .py/.ipynb file, you can directly follow what #Andreas Mueller said.
code:
import pickle
pk.dump(vectorizer,open(r'/relative path','wb'))
pk.dump(pca,open(r'/relative path','wb'))
# ...
# When you want to use:
import pickle
vectoriser = pk.load(open(r'/relative path','rb'))
pea = pk.load(open(r'/relative path','rb'))
#...
Side note:
If I remember correctly, you can also export class or other things using pickle, but when you did so, make sure the class is already defined when you load the object. Not sure if this matters in this case, but I still import PCA and CountVectorizer before I did the pk.load function.
I'm just a beginner in coding so please test my code before use it in your project.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Creating Term Document Matrix from list - python

You should combine all your pre-processing steps into preprocessor and maybe tokenizer functions, see section 4.2.3.10 and CountVectorizer description from scikit-learn docs. For example of such tokenizers/transformers see related question of src code of scikit-learn itself.

Related

Word2Vec Vocab Similarities

Preprocessing textual data into integer indices (like the imdb data set in the tensorFlow text-classification example)

Tensorflow: Tokenizing bi-grams and n-grams using the Tensorflow Datasets utilities

fast way to tokenize data in Python like TfidfVectorizer does

CountVectorizer matrix varies with new test data for classification?

Categories

Resources