I have used this code to apply Lemmatizations to my training df. I have 3 columns: label, article_title and article_text. I have cleaned the the second too(lower case, remove punct and removed stopwords). This code however leaves some words the same? Like examining is examining not examine for example.
lemmatizer = WordNetLemmatizer()`
def lemmatize_text(text):
lemmatizer = WordNetLemmatizer()
return [lemmatizer.lemmatize(w) for w in text]
train['article_title_lemma']=train['article_title_stopwords'].apply(lemmatize_text)
train['article_text_lemma']=train['article_text_stopwords'].apply(lemmatize_text)`
Related
Hi stackoverflow community!
Long-time reader but first-time poster. I'm currently trying my hand at NLP and after reading a few forum posts touching upon this topic, I can't seem to get the lemmatizer to work properly (function pasted below). Comparing my original text vs preprocessed text, all the cleaning steps work as expected, except the lemmatization. I've even tried specifying the part of speech : 'v' to not default the word as noun, and still get the base form of the verb (ex: turned -> turn , are -> be, reading -> read) ... however this doesn't seem to be working.
Appreciate another set of eyes and feedback - thanks!
# key imports
import pandas as pd
import numpy as np
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem import WordNetLemmatizer
import contractions
# cleaning functions
def to_lower(text):
'''
Convert text to lowercase
'''
return text.lower()
def remove_punct(text):
return ''.join(c for c in text if c not in punctuation)
def remove_stopwords(text):
'''
Removes stop words which don't have meaning (ex: is, the, a, etc.)
'''
additional_stopwords = ['app']
stop_words = set(stopwords.words('english')) - set(['not','out','in'])
stop_words = stop_words.union(additional_stopwords)
return ' '.join([w for w in nltk.word_tokenize(text) if not w in stop_words])
def fix_contractions(text):
'''
Expands contractions
'''
return contractions.fix(text)
# preprocessing pipeline
def preprocess(text):
# convert to lower case
lower_text = to_lower(text)
sentence_tokens = sent_tokenize(lower_text)
word_list = []
for each_sent in sentence_tokens:
# fix contractions
clean_text = fix_contractions(each_sent)
# remove punctuation
clean_text = remove_punct(clean_text)
# filter out stop words
clean_text = remove_stopwords(clean_text)
# get base form of word
wnl = WordNetLemmatizer()
for part_of_speech in ['v']:
lemmatized_word = wnl.lemmatize(clean_text, part_of_speech)
# split the sentence into word tokens
word_tokens = word_tokenize(lemmatized_word)
for i in word_tokens:
word_list.append(i)
return word_list
# lemmatize not properly working to get base form of word
# ex: 'turned' still remains 'turned' without returning base form 'turn'
# ex: 'running' still remains 'running' without getting base form 'run'
sample_data = posts_with_text['post_text'].head(5)
print(sample_data)
sample_data.apply(preprocess)
In my understanding, what tokeniser does is that, given each word, the tokeniser will break down the word into sub-words only if the word is not present in the tokeniser.get_vocab() :
def checkModel(model):
tokenizer = AutoTokenizer.from_pretrained(model)
allList = []
for word in tokenizer.get_vocab():
word = word.lower()
tokens = tokenizer.tokenize(word)
try:
if word[0]!='#' and word[0]!='[' and tokens[0] != word:
allList.append((word, tokens))
print(word, tokens)
except:
continue
return allList
checkModel('bert-base-uncased')
# ideally should return an empty list
However, what I have observed is that some models on huggingface will break down words into smaller pieces even if the word is present in the vocab.
checkModel('emilyalsentzer/Bio_ClinicalBERT')
output:
welles ['well', '##es']
lexington ['le', '##xing', '##ton']
palestinian ['pale', '##st', '##inian']
...
elisabeth ['el', '##isa', '##beth']
alexander ['ale', '##xa', '##nder']
appalachian ['app', '##ala', '##chia', '##n']
mitchell ['mit', '##chel', '##l']
...
4630 # tokens in vocab got broken down, not supposed to happen
I have checked a few models of this behaviour, was wondering why is this happening?
This is a really interesting question, and I am currently wondering whether it should be considered as a bug report on the Huggingface repo.
EDIT: I realized that it is possible to define model-specific tokenization_config.json files to overwrite the default behavior. One example is the bert-base-cased repository, which has the following content for the tokenizer config:
{
"do_lower_case": false
}
Given that this functionality is available, I think the best option would be to contact the original author of the work and ask them to potentially consider this configuration (if appropriate for the general use case).
Original Answer:
As it turns out, the vocabulary word that you are checking for is welles, yet the vocab file itself only contains Welles. Notice the difference in the uppercased first letter?
It turns out you can manually force the tokenizer to specifically check for cased vocabulary words, in which case it works fine.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT",
do_lower_case=False) # This is different
print(tokenizer.do_lower_case)
# Output: False
# Lowercase input will result in split word
tokenizer.convert_ids_to_tokens(tokenizer("welles")["input_ids"])
# Output: ['[CLS]', 'well', '##es', '[SEP]']
# Uppercase input will correctly *not split* the word
tokenizer2.convert_ids_to_tokens(tokenizer2("Welles")["input_ids"])
['[CLS]', 'Welles', '[SEP]']
Per default, however, this is not the case, and all words will be converted to lowercase, which is why you cannot find the word:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
# Per default, lowercasing is enabled!
print(tokenizer.do_lower_case)
# Output: True
# This time now we get the same (lowercased) output both times!
tokenizer.convert_ids_to_tokens(tokenizer("welles")["input_ids"])
['[CLS]', 'well', '##es', '[SEP]']
tokenizer.convert_ids_to_tokens(tokenizer("Welles")["input_ids"])
['[CLS]', 'well', '##es', '[SEP]']
The tokenizer you are calling 'emilyalsentzer/Bio_ClinicalBERT' has tokens that are not present in the original base tokenizer. To add tokens to the tokenizer one can either provide a list of strings or a list of tokenizers.AddedTokens.
The default behavior in both cases is to allow new words to be used as subwords. In my example if we add 'director' and 'cto' to the tokenizer, then 'director' can be broken down into 'dire' + 'cto' + 'r' ('dire' and 'r' are a part of the original tokenizer). To avoid this, one should use:
tokenizer.add_tokens([tokenizers.AddedToken(new_word, single_word = True) for new_word in new_words])
I do think a lot of users would simply use a list of strings (as I did, until half an hour ago). But this would lead to the problem that you saw.
To change this for a customized tokenizer (like 'emilyalsentzer/Bio_ClinicalBERT') w/o losing much in model performance, I'd recommend extracting the set of words from this tokenizer, and comparing it to its base tokenizer (for example 'bert-base-uncased'). This will give you the set of words that were added to the base tokenizer as part of model re-training. Then take the base tokenizer and add this new words to it using AddedToken with single_word set to True. Replace the custom tokenizer with this new tokenizer.
As part of pre-processing for a text classification model, I have added stopword removal and lemmatization steps, using the NLTK library. The code is below:
import pandas as pd
import nltk; nltk.download("all")
from nltk.corpus import stopwords; stop = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
# Stopwords removal
def remove_stopwords(entry):
sentence_list = [word for word in entry.split() if word not in stopwords.words("english")]
return " ".join(sentence_list)
df["Description_no_stopwords"] = df.loc[:, "Description"].apply(lambda x: remove_stopwords(x))
# Lemmatization
lemmatizer = WordNetLemmatizer()
def punct_strip(string):
s = re.sub(r'[^\w\s]',' ',string)
return s
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
def lemmatize_rows(entry):
sentence_list = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in punct_strip(entry).split()]
return " ".join(sentence_list)
df["Description - lemmatized"] = df.loc[:, "Description_no_stopwords"].apply(lambda x: lemmatize_rows(x))
The problem is that, when I pre-process a dataset with 27k entries (my test set), it takes 40-45 seconds for stopwords removal and just as long for lemmatization. By contrast, model evaluation only takes 2-3 seconds.
How can I re-write the functions to optimise computation speed? I have read something about vectorization, but the example functions were much simpler than the ones that I have reported, and I wouldn't know how to do it in this case.
A similar question was asked here and suggests that you try caching the stopwords.words("english") object. In your method remove_stopwords you are creating the object every time you evaluate an entry. So, you can definitely improve that. Regarding your lemmatizer, as mentioned here, you can also cache your results to improve performance. I can imagine that your pandas operations are also quite expensive. You may consider converting your dataframe into an array or dictionary and then iterating over it. If you need a dataframe later, you can easily convert it back.
I am using sklearn's TfIdfVectorizer to vectorize my corpus. In my analysis, there are some document which all terms are filtered out due to containing all stopwords. To reduce the sparsity issue and because it is meaningless to include them in the analysis, I would like to remove it.
Looking into the TfIdfVectorizer doc, there is no parameter that can be set to do this. Therefore, I am thinking of removing this manually before passing the corpus into the vectorizer. However, this has a potential issue which the stopwords that I have gotten is not the same as the list used by vectorizer, since I also use both min_df and max_df option to filter out terms.
Is there any better way to achieve what I am looking for (i.e. removing/ignoring document containing all stopwords)?
Any help would be greatly appreciated.
You can:
specify your sopwords and then, after TfidfVecorizer
filter out empty rows
The following code snippet shows a simplified example that should set you in the right direction:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["aa ab","aa ab ac"]
stop_words = ["aa","ab"]
tfidf = TfidfVectorizer(stop_words=stop_words)
corpus_tfidf = tfidf.fit_transform(corpus)
idx = np.array(corpus_tfidf.sum(axis=1)==0).ravel()
corpus_filtered = corpus_tfidf[~idx]
Feel free to ask questions if you still have any!
So, you can use this:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
def tokenize(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
punctuations="?:!.,;'�۪"
for token in tokens:
if token in punctuations:
tokens.remove(token)
if re.search('[a-zA-Z0-9]', token):
filtered_tokens.append(token)
st = ' '.join(filtered_tokens)
return st
tokenize(data)
tfidf_vectorizer = TfidfVectorizer(max_df=0.8,min_df=0.01,stop_words='english',
use_idf=True,tokenizer=tokenize)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])
ids = np.array(tfidf_matrix.sum(axis=1)==0).ravel()
tfidf_filtered = tfidf_matrix[~ids]
This way you can remove stopwords, empty rows and use min_df and max_df.
My goal is to create a system that will be able to take any random text, extract sentences, remove punctuations, and then, on the bare sentence (one of them), to randomly replace NN or VB tagged words with their meronym, holonym or synonim as well as with a similar word from a WordNet synset. There is a lot of work ahead, but I have a problem at the very beginning.
For this I use pattern and TextBlob packages. This is what I have done so far...
from pattern.web import URL, plaintext
from pattern.text import tokenize
from pattern.text.en import wordnet
from textblob import TextBlob
import string
s = URL('http://www.fangraphs.com/blogs/the-fringe-five-baseballs-most-compelling-fringe-prospects-35/#more-157570').download()
s = plaintext(s, keep=[])
secam = (tokenize(s, punctuation=""))
simica = secam[15].strip(string.punctuation)
simica = simica.replace(",", "")
simica = TextBlob(simica)
simicaTg = simica.words
synsimica = wordnet.synsets(simicaTg[3])[0]
djidja = synsimica.hyponyms()
Now everything works the way I want but when I try to extract the i.e. hyponym from this djidja variable it proves to be impossible since it is a Synset object, and I can't manipulate it anyhow.
Any idea how to extract a the very word that is reported in hyponyms list (i.e. print(djidja[2]) displays Synset(u'bowler')...so how to extract only 'bowler' from this)?
Recall that a synset is just a list of words marked as synonyms. Given a sunset, you can extract the words that form it:
from pattern.text.en import wordnet
s = wordnet.synsets('dog')[0] # a word can belong to many synsets, let's just use one for the sake of argument
print(s.synonyms)
This outputs:
Out[14]: [u'dog', u'domestic dog', u'Canis familiaris']
You can also extract hypernims and hyponyms:
print(s.hypernyms())
Out[16]: [Synset(u'canine'), Synset(u'domestic animal')]
print(s.hypernyms()[0].synonyms)
Out[17]: [u'canine', u'canid']