Understanding Text feature extraction TfidfVectorizer in python scikit-learn

Understanding Text feature extraction TfidfVectorizer in python scikit-learn - python

Reading the documentation for text feature extraction in scikit-learn, I am not sure how the different arguments available for TfidfVectorizer (and may be other vectorizers) affect the outcome.
Here are the arguments I am not sure how they work:
TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_df=0.5, min_df=20, use_idf=True)
The documentation is clear on the use of stop_words/ max_df (both have similar effect and may be one can be used instead of the other). However, I am not sure if these options should be used together with ngrams. Which one occurs/handled first, ngrams or stop_words? why? Based on my experiment, stop words are removed first, but the purpose of ngrams is to extract phrases, etc. I am not sure about the effect of this sequence (Stops removed and then ngramed).
Second, does it make sense to use max_df/min_df arguments together with use_idf argument? aren't the purpose of these similar?

I see several questions in this post.
How do the different arguments in TfidfVectorizer interact with one another?
You really have to use it quite a bit to develop a sense of intuition (has been my experience anyway).
TfidfVectorizer is a bag of words approach. In NLP, sequences of words and their window is important; this kind of destroys some of that context.
How do I control what tokens get outputted?
Set ngram_range to (1,1) for outputting only one-word tokens, (1,2) for one-word and two-word tokens, (2, 3) for two-word and three-word tokens, etc.
ngram_range works hand-in-hand with analyzer. Set analyzer to "word" for outputting words and phrases, or set it to "char" to output character ngrams.
If you want your output to have both "word" and "char" features, use sklearn's FeatureUnion. Example here.
How do I remove unwanted stuff?
Use stop_words to remove less-meaningful english words.
The list of stop words that sklearn uses can be found at:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
The logic of removing stop words has to do with the fact that these words don't carry a lot of meaning, and they appear a lot in most text:
[('the', 79808),
('of', 40024),
('and', 38311),
('to', 28765),
('in', 22020),
('a', 21124),
('that', 12512),
('he', 12401),
('was', 11410),
('it', 10681),
('his', 10034),
('is', 9773),
('with', 9739),
('as', 8064),
('i', 7679),
('had', 7383),
('for', 6938),
('at', 6789),
('by', 6735),
('on', 6639)]
Since stop words generally have a high frequency, it might make sense to use max_df as a float of say 0.95 to remove the top 5% but then you're assuming that the top 5% is all stop words which might not be the case. It really depends on your text data. In my line of work, it's very common that the top words or phrases are NOT stop words because I work with dense text (search query data) in very specific topics.
Use min_df as an integer to remove rare-occurring words. If they only occur once or twice, they won't add much value and are usually really obscure. Furthermore, there's generally a lot of them so ignoring them with say min_df=5 can greatly reduce your memory consumption and data size.
How do I Include stuff that's being stripped out?
token_pattern uses a regex pattern \b\w\w+\b which means that tokens have to be at least 2 characters long so words like "I", "a" are removed and also numbers like 0 - 9 are removed. You'll also notice it removes apostrophes
What happens first, ngram generation or stop word removal?
Let's do a little test.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
docs = np.array(['what is tfidf',
'what does tfidf stand for',
'what is tfidf and what does it stand for',
'tfidf is what',
"why don't I use tfidf",
'1 in 10 people use tfidf'])
tfidf = TfidfVectorizer(use_idf=False, norm=None, ngram_range=(1, 1))
matrix = tfidf.fit_transform(docs).toarray()
df = pd.DataFrame(matrix, index=docs, columns=tfidf.get_feature_names())
for doc in docs:
print(' '.join(word for word in doc.split() if word not in ENGLISH_STOP_WORDS))
This prints out:
tfidf
does tfidf stand
tfidf does stand
tfidf
don't I use tfidf
1 10 people use tfidf
Now let's print df:
10 and does don for in is \
what is tfidf 0.0 0.0 0.0 0.0 0.0 0.0 1.0
what does tfidf stand for 0.0 0.0 1.0 0.0 1.0 0.0 0.0
what is tfidf and what does it stand for 0.0 1.0 1.0 0.0 1.0 0.0 1.0
tfidf is what 0.0 0.0 0.0 0.0 0.0 0.0 1.0
why don't I use tfidf 0.0 0.0 0.0 1.0 0.0 0.0 0.0
1 in 10 people use tfidf 1.0 0.0 0.0 0.0 0.0 1.0 0.0
it people stand tfidf use \
what is tfidf 0.0 0.0 0.0 1.0 0.0
what does tfidf stand for 0.0 0.0 1.0 1.0 0.0
what is tfidf and what does it stand for 1.0 0.0 1.0 1.0 0.0
tfidf is what 0.0 0.0 0.0 1.0 0.0
why don't I use tfidf 0.0 0.0 0.0 1.0 1.0
1 in 10 people use tfidf 0.0 1.0 0.0 1.0 1.0
what why
what is tfidf 1.0 0.0
what does tfidf stand for 1.0 0.0
what is tfidf and what does it stand for 2.0 0.0
tfidf is what 1.0 0.0
why don't I use tfidf 0.0 1.0
1 in 10 people use tfidf 0.0 0.0
Notes:
use_idf=False, norm=None when these are set, it's equivalent to using sklearn's CountVectorizer. It will just return counts.
Notice the word "don't" was converted to "don". This is where you'd change token_pattern to something like token_pattern=r"\b\w[\w']+\b" to include apostrophes.
we see a lot of stop words
Let's remove stopwords and look at df again:
tfidf = TfidfVectorizer(use_idf=False, norm=None, stop_words='english', ngram_range=(1, 2))
Outputs:
10 10 people does does stand \
what is tfidf 0.0 0.0 0.0 0.0
what does tfidf stand for 0.0 0.0 1.0 0.0
what is tfidf and what does it stand for 0.0 0.0 1.0 1.0
tfidf is what 0.0 0.0 0.0 0.0
why don't I use tfidf 0.0 0.0 0.0 0.0
1 in 10 people use tfidf 1.0 1.0 0.0 0.0
does tfidf don don use people \
what is tfidf 0.0 0.0 0.0 0.0
what does tfidf stand for 1.0 0.0 0.0 0.0
what is tfidf and what does it stand for 0.0 0.0 0.0 0.0
tfidf is what 0.0 0.0 0.0 0.0
why don't I use tfidf 0.0 1.0 1.0 0.0
1 in 10 people use tfidf 0.0 0.0 0.0 1.0
people use stand tfidf \
what is tfidf 0.0 0.0 1.0
what does tfidf stand for 0.0 1.0 1.0
what is tfidf and what does it stand for 0.0 1.0 1.0
tfidf is what 0.0 0.0 1.0
why don't I use tfidf 0.0 0.0 1.0
1 in 10 people use tfidf 1.0 0.0 1.0
tfidf does tfidf stand use \
what is tfidf 0.0 0.0 0.0
what does tfidf stand for 0.0 1.0 0.0
what is tfidf and what does it stand for 1.0 0.0 0.0
tfidf is what 0.0 0.0 0.0
why don't I use tfidf 0.0 0.0 1.0
1 in 10 people use tfidf 0.0 0.0 1.0
use tfidf
what is tfidf 0.0
what does tfidf stand for 0.0
what is tfidf and what does it stand for 0.0
tfidf is what 0.0
why don't I use tfidf 1.0
1 in 10 people use tfidf 1.0
Take-aways:
the token "don use" happened because don't I use had the 't stripped off and because I was less than two characters, it was removed so the words were joined to don use... which actually wasn't the structure and could potentially change the structure a bit!
Answer: stop words are removed, short characters are removed, then ngrams are generated which can return unexpected results.
does it make sense to use max_df/min_df arguments together with use_idf argument?
My opinion, the whole point of term-frequency inverse document frequency is to allow re-weighting of the highly frequent words (words that would appear a the top of a sorted frequency list). This re-weighting will take the highest frequency ngrams and move them down the list to a lower position. Therefore, it's supposed to handle max_df scenarios.
Maybe it's more of a personal choice whether you want to move them down the list ("re-weight" / de-prioritize them) or remove them completely.
I use min_df a lot and it makes sense to use min_df if you're working with a huge dataset because rare words won't add value and will just cause a lot of processing issues. I don't use max_df much but I'm sure there are scenarios when working with data like all of Wikipedia that this might make sense to remove the top x%.

The stop word removal will not affect your ngrams. A vocabulary (tokens) list is first created according to your tokenizer and ngram range, then stop words are removed from this list (so only unigrams will be affected as the stop word list contains ungrams only). Note that it is not the same if you remove the stop words in the tokenization step (what people often do), then they won't be included in the bigrams either.
Using min_df may in fact counter the effect of tf idf as a word that appeared maybe twice in only one document will have a high score (remember scores are for a document). It depends on the application of your system (information retrieval/ text categorization). If the threshold is low, it shouldn't affect a lot text classification, but retrieval might be biased (what if I want to find documents with "Spain" and it only appears once, in one document, in the entire collection?). Max_df is affected thanks to use_idf as you said, but if you remove the word from vocabulary it might have an stronger impact than only weighting it low. It depends again on what you plan to do with the weights.
Hope this helps.

Related

How to get rid of urls while using TfidfVectorizer

I'm using TfidfVectorizer to extract features of my samples, all texts. However, in my samples, there are so many urls and as a result, http and https become important features. This also causes inaccurate predictions later with my Naive Bayes model.
The features I got are as follows. As you can see, https has high values.
good got great happy http https
0 0.18031992253877868 0.056537832999741425 0.0 0.13494772859235538 0.0 0.7206169458767526
1 0.062052081178508904 0.0 0.03348108448960768 0.03482887785597041 0.0 0.8266008657388199
2 0.066100442981558 0.0 0.03566543577965484 0.03710116101033473 0.0 0.9685823681046619
3 0.030596521808766947 0.028779865519712563 0.0 0.0 0.0 0.9781890670696571
4 0.0 0.03803344358481952 0.0 0.0 0.0 0.9964607105785932
5 0.0 0.0 0.0 0.07716693868942119 0.0 0.938602085540054
6 0.17689804723173405 0.033278959234969596 0.07635828939724364 0.15886424082427333 0.0 0.8718951596544265
7 0.0 0.0 0.02288252957804802 0.0 0.0 0.9603936784408945
8 0.08544543470034431 0.3214885842670747 0.09220660336028486 0.09591841408082484 0.0 0.39837897672993183
9 0.09492740119653752 0.02976370819366948 0.06829257573052833 0.0 0.0 0.9273261812039216
10 0.06892455146463301 0.0648321836892671 0.1859461187415361 0.0 0.0 0.8492883859345594
11 0.06407942255789043 0.02009157746015972 0.13829986166195216 0.023977862240478147 0.0 0.938967971292072
12 0.0 0.06353009389659953 0.03644231525495783 0.0 0.0 0.8772167495025313
13 0.0 0.0 0.044113599370101265 0.030592939021541497 0.0 0.34488252084969045
Please anyone could help me to get rid of this when I extract key words using TfIDF?
This is the vectorizer I initialized:
vectorizer = TfidfVectorizer(input='content', lowercase=True, stop_words='english', analyzer='word', max_features=50)

You can pass a list of stopwords to TfidfVectorizer:
vectorizer = TfidfVectorizer(input='content', lowercase=True, stop_words=['http', 'https'], analyzer='word', max_features=50)
These words will be ignored when vectorizing the texts.
And you can add your words to the default list like this:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
my_stop_words = text.ENGLISH_STOP_WORDS.union(['http', 'https'])
vectorizer = TfidfVectorizer(input='content', lowercase=True, stop_words=my_stop_words, analyzer='word', max_features=50)

Error when imputing minimum values using SimpleImputer

I'm trying to use the minimum values of each column to replace missing values but keep getting an error. Below is my code:
from sklearn.impute import SimpleImputer
numeric_cols = [X_test.select_dtypes(exclude=['object']).columns]
numeric_df = X_test.select_dtypes(exclude=['object'])
for col in numeric_cols:
my_imputer = SimpleImputer(strategy='constant', fill_value=X_test[col].min())
imputed_numeric_X_test = pd.DataFrame(my_imputer.fit_transform(numeric_df))
imputed_numeric_X_test.columns = numeric_df.columns
This is the error I get when I run it:
ValueError: 'fill_value'=MSSubClass 20.0
LotFrontage 21.0
LotArea 1470.0
OverallQual 1.0
OverallCond 1.0
YearBuilt 1879.0
YearRemodAdd 1950.0
MasVnrArea 0.0
BsmtFinSF1 0.0
BsmtFinSF2 0.0
BsmtUnfSF 0.0
TotalBsmtSF 0.0
1stFlrSF 407.0
2ndFlrSF 0.0
LowQualFinSF 0.0
GrLivArea 407.0
BsmtFullBath 0.0
BsmtHalfBath 0.0
FullBath 0.0
HalfBath 0.0
BedroomAbvGr 0.0
KitchenAbvGr 0.0
TotRmsAbvGrd 3.0
Fireplaces 0.0
GarageYrBlt 1895.0
GarageCars 0.0
GarageArea 0.0
WoodDeckSF 0.0
OpenPorchSF 0.0
EnclosedPorch 0.0
3SsnPorch 0.0
ScreenPorch 0.0
PoolArea 0.0
MiscVal 0.0
MoSold 1.0
YrSold 2006.0
dtype: float64 is invalid. Expected a numerical value when imputing numerical data
What is wrong and how can I fix it?

SimpleImputer only supports a single value for fill_value, not a per-column specification. Adding that was discussed in Issue19783, but passed on, and wouldn't support taking the columnwise minimum anyway. I can't find any discussion to add a custom callable option for strategy, which would seem to be the clearest solution. So I think you're stuck doing it manually or with a custom transformer. To do it somewhat manually, you could use the ColumnTransformer approach specified in the linked Issue.

Understanding FeatureHasher, collisions and vector size trade-off

I'm preprocessing my data before implementing a machine learning model. Some of the features are with high cardinality, like country and language.
Since encoding those features as one-hot-vector can produce sparse data, I've decided to look into the hashing trick and used python's category_encoders like so:
from category_encoders.hashing import HashingEncoder
ce_hash = HashingEncoder(cols = ['country'])
encoded = ce_hash.fit_transform(df.country)
encoded['country'] = df.country
encoded.head()
When looking at the result, I can see the collisions
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 country
0 0 0 1 0 0 0 0 0 US <━┓
1 0 1 0 0 0 0 0 0 CA. ┃ US and SE collides
2 0 0 1 0 0 0 0 0 SE <━┛
3 0 0 0 0 0 0 1 0 JP
Further investigation lead me to this Kaggle article. The example of Hashing there include both X and y.
What is the purpose of y, does it help to fight the collision problem?
Should I add more columns to the encoder and encode more than one feature together (for example country and language)?
Will appreciate an explanation of how to encode such categories using the hashing trick.
Update:
Based on the comments I got from #CoMartel, Iv'e looked at Sklearn FeatureHasher and written the following code to hash the country column:
from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=10,input_type='string')
f = h.transform(df.country)
df1 = pd.DataFrame(f.toarray())
df1['country'] = df.country
df1.head()
And got the following output:
0 1 2 3 4 5 6 7 8 9 country
0 -1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 0.0 US
1 -1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 0.0 US
2 -1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 0.0 US
3 0.0 -1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 CA
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 -1.0 0.0 SE
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 JP
6 -1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 AU
7 -1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 AU
8 -1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 DK
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 -1.0 0.0 SE
Is that the way to use the library in order to encode high categorical
values?
Why are some values negative?
How would you choose the "right" n_features value?
How can I check the collisions ratio?

Is that the way to use the library in order to encode high categorical
values?
Yes. There is nothing wrong with your implementation.
You can think about the hashing trick as a "reduced size one-hot encoding with a small risk of collision, that you won't need to use if you can tolerate the original feature dimension".
This idea was first introduced by Kilian Weinberger. You can find in their paper the whole analysis of the algorithm theoretically and practically/empirically.
Why are some values negative?
To avoid collision, a signed hash function is used. That is, the strings are hashed by using the usual hash function first (e.g. a string is converted to its corresponding numerical value by summing ASCII value of each char, then modulo n_feature to get an index in (0, n_features]). Then another single-bit output hash function is used. The latter produces +1 or -1 by definition, where it's added to the index resulted from the first hashing function.
Pseudo code (it looks like Python, though):
def hash_trick(features, n_features):
for f in features:
res = np.zero_like(features)
h = usual_hash_function(f) # just the usual hashing
index = h % n_features # find the modulo to get index to place f in res
if single_bit_hash_function(f) == 1: # to reduce collision
res[index] += 1
else:
res[index] -= 1 # <--- this will make values to become negative
return res
How would you choose the "right" n_features value?
As a rule of thumb, and as you can guess, if we hash an extra feature (i.e. #n_feature + 1), the collision is certainly going to happen. Hence, the best case-scenario is when each feature is mapped to a unique hash value -- hopefully. In this case, logically speaking, n_features should be at least equal to the actual number of features/categories (in your particular case, the number of different countries). Nevertheless, please remember that this is the "best" case scenario, which is not the case "mathematically speaking". Hence, the higher the better of course, but how high? See next.
How can I check the collisions ratio?
If we ignore the second single-bit hash function, the problem is reduced to something called "Birthday problem for Hashing".
This is a big topic. For a comprehensive introduction to this problem, I recommend you read this, and for some detailed math, I recommend this answer.
In a nutshell, what you need to know is that, the probability of no collisions is exp(-1/2) = 60.65%, that means there is approximately 39.35% chance of one collision, at least, to happen.
So, as a rule of thumb, if we have X countries, there is about 40% chance, for at least one collision, if the hash function output range (i.e. n_feature parameter) is X^2. In other words, there is 40% chance of collision if the number of countries in your example = square_root(n_features). As you increase n_features exponentially, the chances of collision is reduced by half. (personally, if it is not for security purposes, but just a plain conversion from string to numbers, it is not worth going too high).
Side-note for curios readers: For a large enough hash function output size(e.g. 256 bits), the chances an attacker guess (or avail of) the collision is almost impossible (from a security perspective).
Regarding the y parameter, as you've already got in a comment, it is just for compatibility purpose, not used (scikit-learn has this along many other implementations).

During handling of the above exception, another exception occurred when using SHAP to interpret keras neural network model

The x_train looks like this (22 features):
total_amount reward difficulty duration discount bogo mobile social web income ... male other_gender age_under25 age_25_to_35 age_35_to_45 age_45_to_55 age_55_to_65 age_65_to_75 age_75_to_85 age_85_to_105
0 0.006311 0.2 0.50 1.000000 1.0 0.0 1.0 1.0 1.0 0.355556 ... 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.015595 0.2 0.50 1.000000 1.0 0.0 1.0 1.0 1.0 0.977778 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
The label is 0 and 1, it's a binary classification problem, here's the code for building the model, and I was following this page to implement SHAP:
#use SHAG
deep_explainer = shap.DeepExplainer(nn_model_2, x_train[:100])
# explain the first 10 predictions
# explaining each prediction requires 2 * background dataset size runs
shap_values = deep_explainer.shap_values(x_train)
This gave me error:
KeyError: 0
During handling of the above exception, another exception occurred
I have no idea what this message is complaining, I tried to use SHAP with a XGBoost and Logistic Regression model and they both work fine, I'm new to keras and SHAP, can someone have a look for me and how I can solved it? Many thanks.

I think SHAP (whatever it is) is expecting a Numpy array and so indexing x_train like a Numpy array, it yields an error. Try:
shap_values = deep_explainer.shap_values(x_train.values)

Scikit learn categorical features ranking

My data contained a lot of categorical data, for example, Age, color, size, race, gender and so on.
The problem is that in scikit-learn we could not set the features as a factor as in R, therefore we have to convert the categorical data in to the dummy column. As
color size
green M
red L
blue XL
convert to
color_blue color_green color_red size_L size_M size_XL
0.0 1.0 0.0 0.0 1.0 0.0
0.0 0.0 1.0 1.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 1.0
However, I would like to rank the features as the color or size, not color_blue or size_M.
Is there any possible ways to do it? or I can summarize the value from the ranking score from each related feature?
(like score for color column should be sum of (green blue and red scores))
Note that I use ExtraTreesClassifier for the ranking score calculation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understanding Text feature extraction TfidfVectorizer in python scikit-learn - python

Related

How to get rid of urls while using TfidfVectorizer

Error when imputing minimum values using SimpleImputer

Understanding FeatureHasher, collisions and vector size trade-off

During handling of the above exception, another exception occurred when using SHAP to interpret keras neural network model

Scikit learn categorical features ranking

Categories

Resources