Weird lines when using tfidvectoriser, may be caused by replacing with ' '?

Weird lines when using tfidvectoriser, may be caused by replacing with ' '? - python

i posted on here yesterday about making a text linear regression model for predicting sentiment, what i'm wondering is after lowercasing the text, removing any stopword/punctuation and numbers, i'm left with weird lines on some of my text features.
['_______',
'__________',
'__________ pros',
'____________',
'____________ pros',
'_____________',
'_____________ pros',
'aa',
'aa waist',
'ab',
'abdomen',
'ability',
'able',
'able button',
'able buy',
i'm thinking prehaps its because for punctuations and numbers, i replaced them with a space? i'm still not absolutely sure.
Another question is how do i structure this properly for linear regression? should i represent each sentence by a column of their features and feed it into the network? but how would i handle if the matrix is sparse?
Sorry just learning more about text preprocessing
here are my cleaning steps: lets assume a sentence is like this 'this dress in a lovely platinum is feminine and fits perfectly, easy to wear and comfy, too! highly recommend!'
lowercase
AllSentences['Sentence'] = AllSentences['Sentence'].map(lambda x: x.lower())
2.remove stop words
stop = stopwords.words('english')
AllSentences['Sentences_without_stopwords'] = AllSentences['Sentence'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
3.remove numbers
AllSentences['Sentences_without_stopwords_punc'] = AllSentences['Sentences_without_stopwords'].apply(lambda x: re.sub(r'[^\w\s]', '',x))
AllSentences['Sentences_without_stopwords_punc'] = AllSentences['Sentences_without_stopwords_punc'].apply(lambda x: re.sub(r'\d+', '',x))
test/train split, tfidvectorise
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size=0.30, random_state=42)
vect_word = TfidfVectorizer(max_features=20000, lowercase=True,
analyzer='word',stop_words= 'english',ngram_range=(1,3),dtype=np.float32)
tr_vect = vect_word.fit_transform(X_train)
ts_vect = vect_word.transform(X_test)
which gave me the above output for feature names?

I think using the TfidfVectorizer is a great place to start for an initial attempt at sentiment analysis. To avoid sparsity in your feature vectors, you may want to start with less features and work your way up depending on how your models perform. You could possibly make it a hyperparameter while training and use GridSearch and a Pipeline to find the best value for it. See an example of that here. Depending on how this goes, a more robust implementation might use word embeddings. However, this would most likely introduce more complexity to your model.
The weird lines in the string are underscore characters that must have been in the source text. They were not cleaned up in your cleanup process because you used the re.sub(r'[^\w\s]', '',x) to remove non-word characters and non-whitespace from your string. Underscores are part of the word character set ('\w'), so they were not cleaned up.
I should also point out that most of your custom cleaning should not be necessary, as the TfidfVectorizer should be able to handle this for you. For example, you remove stop words, and then TfidfVectorizer tries to remove them as well. This is also the case with removing punctuation and numbers from the string. TfidfVectorizer takes a token parameter, and you can pass a regular expression to it to pick what characters you want to keep in a token. If you only want alpha characters in your string, this regex for the token argument should suffice to handle the cleaning for you: '[a-zA-Z]'. Again, I do not use the '\w' character set here since it includes underscores (and numbers).
As you have already ran the fit_transform method of TfidfVectorizer on your train set and the transform method on your test set, the samples in those sets should be ready for training/testing. They need not be further processed.

Related

Preprocessing a list of list removing stopwords for doc2vec using map without losing words order

I am implementing a simple doc2vec with gensim, not a word2vec
I need to remove stopwords without losing the correct order to a list of list.
Each list is a document and, as I understood for doc2vec, the model will have as input a list of TaggedDocuments
model = Doc2Vec(lst_tag_documents, vector_size=5, window=2, min_count=1, workers=4)
dataset = [['We should remove the stopwords from this example'],
['Otherwise the algo'],
["will not work correctly"],
['dont forget Gensim doc2vec takes list_of_list' ]]
STOPWORDS = ['we','i','will','the','this','from']
def word_filter(lst):
lower=[word.lower() for word in lst]
lst_ftred = [word for word in lower if not word in STOPWORDS]
return lst_ftred
lst_lst_filtered= list(map(word_filter,dataset))
print(lst_lst_filtered)
Output:
[['we should remove the stopwords from this example'], ['otherwise the algo'], ['will not work correctly'], ['dont forget gensim doc2vec takes list_of_list']]
Expected Output:
[[' should remove the stopwords example'], ['otherwise the algo'], [' not work correctly'], ['dont forget gensim doc2vec takes list_of_list']]
What was my mistake and how to fix?
There are other efficient ways to solve this issue without losing the
proper order?
List of questions I examined before asking:
How to apply a function to each sublist of a list in python?
I studied this and tried to apply on my specific case
Removing stopwords from list of lists
The order is important I can't use set
Removing stopwords from a list of text files
This could be a possible solution is similar to what I have implemented.
I undestood that the difference, but I don't know how to deal with it.
In my case the document is not tokenized (and should not be tokenized because is a doc2vec not a word2vec)
How to remove stop words using nltk or python
In this question the SO is dealing with a list not a list of list

lower is a list of one element, word not in STOPWORDS will return False. Take the first item in the list with index and split by blank space
lst_ftred = ' '.join([word for word in lower[0].split() if word not in STOPWORDS])
# output: ['should remove stopwords example', 'otherwise algo', 'not work correctly', 'dont forget gensim doc2vec takes list_of_list']
# 'the' is also in STOPWORDS

First, note it's not that important to remove stopwords from Doc2Vec training. Second, note that such tiny toy datasets won't deliver interesting results from Doc2Vec. Tha algorithm, like Word2Vec, only starts to show its value when trained on large datasets with (1) many, many more unique words than the number of vector dimensions; and (2) lots of varied examples of the usage of each word - at least a few, ideally dozens or hundreds.
Still, if you wanted to strip stopwords, it'd be easiest if you did it after tokenizing the raw strings. (That is, splitting the strings into lists-of-words. That's the format Doc2Vec will need anyway.) And, you don't want your dataset to be a list-of-lists-with-one-string-each. Instead, you want it to be either a list-of-strings (at first), then a list-of-lists-with-many-tokens-each.
The following should work:
string_dataset = [
'We should remove the stopwords from this example',
'Otherwise the algo',
"will not work correctly",
'dont forget Gensim doc2vec takes list_of_list',
]
STOPWORDS = ['we','i','will','the','this','from']
# Python list comprehension to break into tokens
tokenized_dataset = [s.split() for s in string_dataset]
def filter_words(tokens):
"""lowercase each token, and keep only if not in STOPWORDS"""
return [token.lower() for token in tokens if token not in STOPWORDS]
filtered_dataset = [filter_words(s) for sent in tokenized_dataset]
Finally, because as noted above, Doc2Vec needs multiple word examples to work well, it's almost always a bad idea to use min_count=1.

How to specify word vector for OOV terms in Spacy?

I have a pre-trained word2vec model that I load to spacy to vectorize new words. Given new text I perform nlp('hi').vector to obtain the vector for the word 'hi'.
Eventually, a new word needs to be vectorized which is not present in the vocabulary of my pre-trained model. In this scenario spacy defaults to a vector filled with zeros. I would like to be able to set this default vector for OOV terms.
Example:
import spacy
path_model= '/home/bionlp/spacy.bio_word2vec.model'
nlp=spacy.load(path_spacy)
print(nlp('abcdef').vector, '\n',nlp('gene').vector)
This code outputs a dense vector for the word 'gene' and a vector full of 0s for the word 'abcdef' (since it's not present in the vocabulary):
My goal is to be able to specify the vector for missing words, so instead of getting a vector full of 0s for the word 'abcdef' you can get (for instance) a vector full of 1s.

If you simply want your plug-vector instead of the SpaCy default all-zeros vector, you could just add an extra step where you replace any all-zeros vectors with yours. For example:
words = ['words', 'may', 'by', 'fehlt']
my_oov_vec = ... # whatever you like
spacy_vecs = [nlp(word) for word in words]
fixed_vecs = [vec if vec.any() else my_oov_vec
for vec in spacy_vecs]
I'm not sure why you'd want to do this. Lots of work with word-vectors simply elides out-of-vocabulary words; using any plug value, including SpaCy's zero-vector, may just be adding unhelpful noise.
And if better handling of OOV words is important, note that some other word-vector models, like FastText, can synthesize better-than-nothing guess-vectors for OOV words, by using vectors learned for subword fragments during training. That's similar to how people can often work out the gist of a word from familiar word-roots.

Sklearn - feature extraction from text - normalize text features by merging plural and singular forms

I am doing some text classification right now using sklearn.
As first step I obviously need to use vectorizer - either CountVectorizer or TfIdfVectorizer. The issue which I want to tackle is that in my documents often times I have singular and plural forms of same word. When performing vectorization I want to 'merge' singular and plural forms and treat them as a same text feature.
Obviously I can manually pre-process texts and just replace all plural word forms with singular word forms when I know which words have this issue. But maybe there is some way to do it in a more automated way, when words which are extremely similar to each other are merged into same feature?
UPDATE.
Based on the answer provided earlier, I needed to perform a stemming. Below is a sample code which stems all words in 'review' column of a dataframe DF, which I then use in vectorization and classification. Just in case anyone finds it useful.
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
df['review_token']=df['review'].apply(lambda x : filter(None,x.split(" ")))
df['review_stemmed']=df['review_token'].apply(lambda x : [stemmer.stem(y) for y in x])
df['review_stemmed_sentence']=df['review_stemmed'].apply(lambda x : " ".join(x))

I think what you need is stemming, namely removing the endings of words that have a common root, and it's one of the basic operations in preprocessing text data.
Here's some rules for stemming and lemmatization explained: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

Regex for unicode sentences without spaces using CountVectorizer

I hope I don't have to provide an example set.
I have a 2D array where each array contains a set of words from sentences.
I am using a CountVectorizer to effectively call fit_transform on the whole 2D array, such that I can build a vocabulary of words.
However, I have sentences like:
u'Besides EU nations , Switzerland also made a high contribution at Rs 171 million LOCATION_SLOT~-nn+nations~-prep_besides nations~-prep_besides+made~prep_at made~prep_at+rs~num rs~num+NUMBER_SLOT'
And my current vectorizer is too strict at removing things like ~ and + as tokens. Whereas I want it to see each word in terms of split() a token in the vocab, i.e. rs~num+NUMBER_SLOT should be a word in itself in the vocab, as should made. At the same time, stopwords like the the a (the normal stopwords set) should be removed.
Current vectorizer:
vectorizer = CountVectorizer(analyzer="word",stop_words=None,tokenizer=None,preprocessor=None,max_features=5000)
You can specify a token_pattern but I am not sure which one I could use to achieve my aims. Trying:
token_pattern="[^\s]*"
Leads to a vocabulary of:
{u'': 0, u'p~prep_to': 3764, u'de~dobj': 1107, u'wednesday': 4880, ...}
Which messes things up as u'' is not something I want in my vocabulary.
What is the right token pattern for this type of vocabulary_ I want to build?

I have figured this out. The vectorizer was allowing 0 or more non-whitespace items - it should allow 1 or more. The correct CountVectorizer is:
CountVectorizer(analyzer="word",token_pattern="[\S]+",tokenizer=None,preprocessor=None,stop_words=None,max_features=5000)

Add stop_words while performing TF-IFcosine similarity

I'm using sklearn to perform cosine similarity.
Is there a way to consider all the words starting with a capital letter as stop words?

The following regex will take as input a string, and remove/replace all sequences of alphanumeric characters that begin with an uppercase character with the empty string. See http://docs.python.org/2.7/library/re.html for more options.
s1 = "The cat Went to The store To get Some food doNotMatch"
r1 = re.compile('\\b[A-Z]\w+')
r1.sub('',s1)
' cat to store get food doNotMatch'
Sklearn also has many great facilities for text feature generation, such as sklearn.feature_extraction.text Also you might want to consider NLTK to assist in sentence segmentation, removing stop words, etc...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.