Build vocab using spacy - python

I'm using spacy tokenizer to tokenize my data, and then build vocab.
This is my code:
import spacy
nlp = spacy.load("en_core_web_sm")
def build_vocab(docs, max_vocab=10000, min_freq=3):
stoi = {'<PAD>':0, '<UNK>':1}
itos = {0:'<PAD>', 1:'<UNK>'}
word_freq = {}
idx = 2
for sentence in docs:
for word in [i.text.lower() for i in nlp(sentence)]:
if word not in word_freq:
word_freq[word] = 1
else:
word_freq[word] += 1
if word_freq[word] == min_freq:
if len(stoi) < max_vocab:
stoi[word] = idx
itos[idx] = word
idx += 1
return stoi, itos
But it takes hours to complete since I have more than 800000 sentences.
Is there a faster and better way to achieve this? Thanks.
update: tried to remove min_freq:
def build_vocab(docs, max_vocab=10000):
stoi = {'<PAD>':0, '<UNK>':1}
itos = {0:'<PAD>', 1:'<UNK>'}
idx = 2
for sentence in docs:
for word in [i.text.lower() for i in nlp(sentence)]:
if word not in stoi:
if len(stoi) < max_vocab:
stoi[word] = idx
itos[idx] = word
idx += 1
return stoi, itos
still takes a long time, does spacy have a function to build vocab like in torchtext (.build_vocab).

There are a couple of things you can do to make this faster.
import spacy
from collections import Counter
def build_vocab(texts, max_vocab=10000, min_freq=3):
nlp = spacy.blank("en") # just the tokenizer
wc = Counter()
for doc in nlp.pipe(texts):
for word in doc:
wc[word.lower_] += 1
word2id = {}
id2word = {}
for word, count in wc.most_common():
if count < min_freq: break
if len(word2id) >= max_vocab: break
wid = len(word2id)
word2id[word] = wid
id2word[wid] = word
return word2id, id2word
Explanation:
If you only use the tokenizer you can use spacy.blank
nlp.pipe is fast for lots of text (less important, maybe irrelevant with blank model though)
Counter is optimized for this kind of counting task
Another thing is that the way you are building your vocab in your initial example, you will take the first N words that have enough tokens, not the top N words, which is probably wrong.
Another thing is that if you're using spaCy you shouldn't build your vocab this way - spaCy has its own built-in vocab class that handles converting tokens to IDs. I guess you might need this mapping for a downstream task or something but look at the vocab docs to see if you can use that instead.

Related

How to split input text into equal size of tokens, not character length, and then concatenate the summarization results for Hugging Face transformers

I am using the below methodology to summarize longer than 1024 token size long texts.
Current method splits the text by half. I took this from another user's post and modified it slightly.
So what I want to do is, instead of splitting into half, split whole text into 1024 equal sized tokens and get summarization each of them and then at the end, concatenate them with the correct order and write into file. How can I do this tokenization and getting the correct output?
text split with Split(" ") doesn't work same as tokenization. It produces different count.
import logging
from transformers import pipeline
f = open("TextFile1.txt", "r")
ARTICLE = f.read()
summarizer = pipeline("summarization", model="facebook/bart-large-cnn" )
counter = 1
def summarize_text(text: str, max_len: int) -> str:
global counter
try:
#logging.warning("max_len " + str(max_len))
summary = summarizer(text, min_length=30, do_sample=False)
with open('parsed_'+str(counter)+'.txt', 'w') as f:
f.write(text)
counter += 1
return summary[0]["summary_text"]
except IndexError as ex:
logging.warning("Sequence length too large for model, cutting text in half and calling again")
return summarize_text(text=text[:(len(text) // 2)], max_len=max_len) + " " + summarize_text(text=text[(len(text) // 2):], max_len=max_len)
gg = summarize_text(ARTICLE, 1024)
with open('summarized.txt', 'w') as f:
f.write(gg)
I like splitting text using nltk. You can also do it with spacy and the quality is better, but it takes a bit longer. nltk and spacy allow you to cut text into sentences and this is better because the text pieces are more coherent. You want to cut it less than 1024 to be on the safe side. 512 should be better and it's what the original BERT uses, so it shouldn't be too bad. You just summarize the summarizations in the end. Here's an example:
import nltk
from nltk.tokenize import sent_tokenize
def split_in_segments(text):
tokens = 0
mystring = list()
segments = []
for sent in sent_tokenize(text):
newtokens = len(sent.split())
tokens += newtokens
mystring.append(str(sent).strip())
if tokens > 512:
segments.append(" ".join(mystring))
mystring = []
tokens = 0
if mystring:
segments.append(" ".join(mystring))
return(segments)
def summarize_4_plotly(text):
segments = split_in_segments(text)
summarylist = summarizer(segments, max_length=100, min_length=30, do_sample=False)
summary = summarizer(" ".join([summarylist[i]['summary_text'] for i in range(len(summarylist))]), max_length = 120, min_length = 30, do_sample = False)
return(summary)
summarize_4_plotly(text)

Python pandas trying to make word count

Hi I just noticed the tweepy api, I can create dataframe with pandas using tweets object which fetched from tweepy. I want to make a word count df to my tweets. here's my code
freq_df = hastag_tweets_df["Tweet"].apply(lambda x: pd.value_counts(x.split(" "))).sum(axis =0).sort_values(ascending=False).reset_index().head(10)
freq_df.columns = ["Words","freq"]
print('FREQ DF\n\n')
print(freq_df)
print('\n\n')
a = freq_df[freq_df.freq > freq_df.freq.mean() + freq_df.freq.std()]
#plotting
fig =a.plot.barh(x = "Words",y = "freq").get_figure()
This looks not good as I want because it always starts with "empty space" and "the" word like
Words freq
0 301.0
1 the 164.0
So how I can get desired data, without empty line and some words like 'the'.
Thank you
Use can use spaCy library to do it. With this library you can easily remove words like "the","a" known as stop words:
its easy to install : pip install spacy
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load("en_core_web_sm")
lexeme = nlp.vocab[word]
# process text using spacy
def process_text(text):
#Tokenize text
doc = nlp(text)
token_list = [t.text for t in doc]
#remove stop words
filtered_sentence =[]
for word in token_list:
lexeme = nlp.vocab[word]
if lexeme.is_stop == False:
filtered_sentence.append(word)
# here we return the length of the filtered words if you want you can return the list as well
return len(filtered_sentence)
df = (
df
.assign(words_count=lambda d: d['comment'].apply(lambda c: process_text(c) ))
)

Code removes stopwords but Word2vec still creates wordvector for stopword?

I have a code that loads a file, strips each sentence and then removes some stopwords and returns the tokens.
So far so good.. If I include a print() statement or do a simple example, I see that stopwords are removed BUT..
when I run the sentences in my word2vec model, the model still creates a wordvector for stopwords like 'the' .. is there an error in my code??
class Raw_Sentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for file in file_loads: # list with the according file names e.g. 'Users/file1.txt'
with open(file,'r', buffering=20000000, encoding='utf-8') as t:
for sentence in tokenizer.tokenize(t.read().replace('\n', ' ').lower()):
sent = remove_stopwords(sentence)
print(sent)
yield gensim.utils.simple_preprocess(sent, deacc=True)
Then I run:
sentences = Raw_Sentences(directory)
num_features = 200
min_word_count = 2
num_workers = cpu_count()
context_size = 4
downsampling = 1e-5
seed = 2
model = gensim.models.Word2Vec(sentences,
sg=1, #skip-gram
seed=seed,
workers=num_workers,
size=num_features,
min_count=min_word_count,
window=context_size,
sample=downsampling)
model.most_similar('the')
and it returns similar words.. But the word 'the' should have been removed...
crying out loud
remove_stopwords is a gensim function from gensim.parsing.preprocessing import remove_stopwords which takes a set of stopwords stoplist = set(stop_words) and removes them def remove_stopwords(s): ## del
s = utils.to_unicode(s)
return " ".join(w for w in s.split() if w not in stoplist)
Are you sure your corpus doesn't contain any instances of 'thé'? (If it did, that might not be removed by remove_stopwords(), but then when passed through simple_preprocess(..., deacc=True) the accent-removal would convert it to plain 'the'.)
Note also that lots of published Word2Vec work doesn't bother to remove stop words. The sample downsampling will already thin out the occurrences of any very-common words, without needing any fixed list of stop-words.
So even if your code is debugged, that entire stop-word-removal step may be an unnecessary source of complication & fragility in your code.

Usage of nltk Sentiwordnet with python

I am doing sentiment analysis on twitter data using python NLTK. I need a dictionary which contains +ve and -ve polarities of words. I have read so much stuff regarding sentiwordnet but when I am using it for my project it is not giving efficient and fast results. I think I'm not using it correctly. Can anyone tell me correct way to use it? Here are the steps I did up to now:
tokenization of tweets
POS tagging of tokens
passing each tags to sentinet
I am using the nltk package for tokenization and tagging. See a part of my code below:
import nltk
from nltk.stem import *
from nltk.corpus import sentiwordnet as swn
tokens=nltk.word_tokenize(row) #for tokenization, row is line of a file in which tweets are saved.
tagged=nltk.pos_tag(tokens) #for POSTagging
for i in range(0,len(tagged)):
if 'NN' in tagged[i][1] and len(swn.senti_synsets(tagged[i][0],'n'))>0:
pscore+=(list(swn.senti_synsets(tagged[i][0],'n'))[0]).pos_score() #positive score of a word
nscore+=(list(swn.senti_synsets(tagged[i][0],'n'))[0]).neg_score() #negative score of a word
elif 'VB' in tagged[i][1] and len(swn.senti_synsets(tagged[i][0],'v'))>0:
pscore+=(list(swn.senti_synsets(tagged[i][0],'v'))[0]).pos_score()
nscore+=(list(swn.senti_synsets(tagged[i][0],'v'))[0]).neg_score()
elif 'JJ' in tagged[i][1] and len(swn.senti_synsets(tagged[i][0],'a'))>0:
pscore+=(list(swn.senti_synsets(tagged[i][0],'a'))[0]).pos_score()
nscore+=(list(swn.senti_synsets(tagged[i][0],'a'))[0]).neg_score()
elif 'RB' in tagged[i][1] and len(swn.senti_synsets(tagged[i][0],'r'))>0:
pscore+=(list(swn.senti_synsets(tagged[i][0],'r'))[0]).pos_score()
nscore+=(list(swn.senti_synsets(tagged[i][0],'r'))[0]).neg_score()
At the end I will be calculating how many tweets are positive and how many tweets are negative.
Where am I wrong? How should I use it? And is there any other similar kind of dictionary which is easy to use?
Yes, there are other lexicons that you can use. You can find a small list of lexicons here: http://sentiment.christopherpotts.net/lexicons.html#resources
It seems Bing Liu's Opinion Lexicon is quite easy to use.
Apart from linking to those lexicons that website is a very nice tutorial on sentiment analysis.
calculate the sentiment
alist = [all_tokens_in_doc]
totalScore = 0
count_words_included = 0
for word in all_words_in_comment:
synset_forms = list(swn.senti_synsets(word[0], word[1]))
if not synset_forms:
continue
synset = synset_forms[0]
totalScore = totalScore + synset.pos_score() - synset.neg_score()
count_words_included = count_words_included +1
final_dec = ''
if count_words_included == 0:
final_dec = 'N/A'
elif totalScore == 0:
final_dec = 'Neu'
elif totalScore/count_words_included < 0:
final_dec = 'Neg'
elif totalScore/count_words_included > 0:
final_dec = 'Pos'
return final_dec

n-grams with Naive Bayes classifier

Im new to python and need help!
i was practicing with python NLTK text classification.
Here is the code example i am practicing on
http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/
Ive tried this one
from nltk import bigrams
from nltk.probability import ELEProbDist, FreqDist
from nltk import NaiveBayesClassifier
from collections import defaultdict
train_samples = {}
with file ('positive.txt', 'rt') as f:
for line in f.readlines():
train_samples[line]='pos'
with file ('negative.txt', 'rt') as d:
for line in d.readlines():
train_samples[line]='neg'
f=open("test.txt", "r")
test_samples=f.readlines()
def bigramReturner(text):
tweetString = text.lower()
bigramFeatureVector = {}
for item in bigrams(tweetString.split()):
bigramFeatureVector.append(' '.join(item))
return bigramFeatureVector
def get_labeled_features(samples):
word_freqs = {}
for text, label in train_samples.items():
tokens = text.split()
for token in tokens:
if token not in word_freqs:
word_freqs[token] = {'pos': 0, 'neg': 0}
word_freqs[token][label] += 1
return word_freqs
def get_label_probdist(labeled_features):
label_fd = FreqDist()
for item,counts in labeled_features.items():
for label in ['neg','pos']:
if counts[label] > 0:
label_fd.inc(label)
label_probdist = ELEProbDist(label_fd)
return label_probdist
def get_feature_probdist(labeled_features):
feature_freqdist = defaultdict(FreqDist)
feature_values = defaultdict(set)
num_samples = len(train_samples) / 2
for token, counts in labeled_features.items():
for label in ['neg','pos']:
feature_freqdist[label, token].inc(True, count=counts[label])
feature_freqdist[label, token].inc(None, num_samples - counts[label])
feature_values[token].add(None)
feature_values[token].add(True)
for item in feature_freqdist.items():
print item[0],item[1]
feature_probdist = {}
for ((label, fname), freqdist) in feature_freqdist.items():
probdist = ELEProbDist(freqdist, bins=len(feature_values[fname]))
feature_probdist[label,fname] = probdist
return feature_probdist
labeled_features = get_labeled_features(train_samples)
label_probdist = get_label_probdist(labeled_features)
feature_probdist = get_feature_probdist(labeled_features)
classifier = NaiveBayesClassifier(label_probdist, feature_probdist)
for sample in test_samples:
print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))
but getting this error, why?
Traceback (most recent call last):
File "C:\python\naive_test.py", line 76, in <module>
print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))
File "C:\python\naive_test.py", line 23, in bigramReturner
bigramFeatureVector.append(' '.join(item))
AttributeError: 'dict' object has no attribute 'append'
A bigram feature vector follows the exact same principals as a unigram feature vector. So, just like the tutorial you mentioned you will have to check if a bigram feature is present in any of the documents you will use.
As for the bigram features and how to extract them, I have written the code bellow for it. You can simply adopt them to change the variable "tweets" in the tutorial.
import nltk
text = "Hi, I want to get the bigram list of this string"
for item in nltk.bigrams (text.split()): print ' '.join(item)
Instead of printing them you can simply append them to the "tweets" list and you are good to go! I hope this would be helpful enough. Otherwise, let me know if you still have problems.
Please note that in applications like sentiment analysis some researchers tend to tokenize the words and remove the punctuation and some others don't. From experince I know that if you don't remove punctuations, Naive bayes works almost the same, however an SVM would have a decreased accuracy rate. You might need to play around with this stuff and decide what works better on your dataset.
Edit 1:
There is a book named "Natural language processing with Python" which I can recommend it to you. It contains examples of bigrams as well as some exercises. However, I think you can even solve this case without it. The idea behind selecting bigrams a features is that we want to know the probabilty that word A would appear in our corpus followed by the word B. So, for example in the sentence
"I drive a truck"
the word unigram features would be each of those 4 words while the word bigram features would be:
["I drive", "drive a", "a truck"]
Now you want to use those 3 as your features. So the code function bellow puts all bigrams of a string in a list named bigramFeatureVector.
def bigramReturner (tweetString):
tweetString = tweetString.lower()
tweetString = removePunctuation (tweetString)
bigramFeatureVector = []
for item in nltk.bigrams(tweetString.split()):
bigramFeatureVector.append(' '.join(item))
return bigramFeatureVector
Note that you have to write your own removePunctuation function. What you get as output of the above function is the bigram feature vector. You will treat it exactly the same way the unigram feature vectors are treated in the tutorial you mentioned.

Categories

Resources