key word extraction with TF_IDF - python

I want to write a function to get one element of my list and tell me 10 key words of it using TF-IDF.I have seen codes but I could not implement it. each element of my list is a long sentence.
I have written these two functions and I do not know how to do what I said above.
def fit(train_data):
cleaned_lst=[]
for element in train_data:
#removing customized stop words
cleaned = remove(element)
cleaned_lst.append(cleaned)
for sentence in cleaned_lst:
vectorizer = TfidfVectorizer(tokenizer = word_tokenize)
fitted_data = vectorizer.fit([sentence])
return fitted_data
def transfom(test_data):
transformed_data = fit(train_data).transform([element for element in test_data])
return transformed_data

Related

Is there a faster way to lookup dictionary indices?

I am trying to look up dictionary indices for thousands of strings and this process is very, very slow. There are package alternatives, like KeyedVectors from gensim.models, which does what I want to do in about a minute, but I want to do what the package does more manually and to have more control over what I am doing.
I have two objects: (1) a dictionary that contains key : values for word embeddings, and (2) my pandas dataframe with my strings that need to be transformed into the index value found for each word in object (1). Consider the code below -- is there any obvious improvement to speed or am I relegated to external packages?
I would have thought that key lookups in a dictionary would be blazing fast.
Object 1
embeddings_dictionary = dict()
glove_file = open('glove.6B.200d.txt', encoding="utf8")
for line in glove_file:
records = line.split()
word = records[0]
vector_dimensions = np.asarray(records[1:], dtype='float32')
embeddings_dictionary [word] = vector_dimensions
Object 2 (The slowdown)
no_matches = []
glove_tokenized_data = []
for doc in df['body'][:5]:
doc = doc.split()
ints = []
for word in doc:
try:
# the line below is the problem
idx = list(embeddings_dictionary.keys()).index(word)
except:
idx = 400000 # unknown
no_matches.append(word)
ints.append(idx)
glove_tokenized_data.append(ints)
You've got a mapping of word -> np.array. It appears you want a quick way to map word to its location in the key list. You can do that with another dict.
no_matches = []
glove_tokenized_data = []
word_to_index = dict(zip(embeddings_dictionary.keys(), range(len(embeddings_dictionary))))
for doc in df['body'][:5]:
doc = doc.split()
ints = []
for word in doc:
try:
idx = word_to_index[word]
except KeyError:
idx = 400000 # unknown
no_matches.append(word)
ints.append(idx)
glove_tokenized_data.append(ints)
In the line you marked as a problem, you are first creating a list from the keys and then looking up the word in the list. You're doing this inside the loop so the first thing you could do is take this logic to the top of the block (outside the loop) to avoid repeated processing and second you're doing all this searching now on a list, not a dictionary.
Why not create another dictionary like this on top of the file:
reverse_lookup = { word: index for word, index in enumerate(embeddings_dictionary.keys()) }
and then use this dictionary to look up the index of your word. Something like this:
for word in doc:
if word in reverse_lookup:
ints.append(reverse_lookup[word])
else:
no_matches.append(word)

Apply function to all inputs of a list dictionary

I am trying to run a function pre_process on a list input k1_tweets_filtered['text'].
however, the function only seems to work on one input at a time i.e. k1_tweets_filtered[1]['text'].
I want the function to run on all inputs of k1_tweets_filtered['text'].
I have tried to use loops, however, the loop only outputs the words of the first input .
I am wondering if this is the right approach as to how I can apply this to the rest of the inputs
This is the question I am trying to solve and what I have coded so far.
Write your code to pre-process and clean up all tweets
stored in the variable k1_tweets_filtered, k2_tweets_filtered and k3_tweets_filtered using the
function pre_process() to result in new variables k1_tweets_processed, k2_tweets_processed
and k3_tweets_processed.
for x in range(len(k1_tweets_filtered)):
tweet_k1 = k1_tweets_filtered[x]['text']
x+=1
k1_tweets_processed = pre_process(tweet_k1)
The function pre_process is below, however, I know that this is correct, as it was given to me.
def remove_non_ascii(s): return "".join(i for i in s if ord(i)<128)
def pre_process(doc):
"""
pre-processes a doc
* Converts the tweet into lower case,
* removes the URLs,
* removes the punctuations
* tokenizes the tweet
* removes words less that 3 characters
"""
doc = doc.lower()
# getting rid of non ascii codes
doc = remove_non_ascii(doc)
# replacing URLs
url_pattern = "http://[^\s]+|https://[^\s]+|www.[^\s]+|[^\s]+\.com|bit.ly/[^\s]+"
doc = re.sub(url_pattern, 'url', doc)
# removing dollars and usernames and other unnecessary stuff
userdoll_pattern = "\$[^\s]+|\#[^\s]+|\&[^\s]+|\*[^\s]+|[0-9][^\s]+|\~[^\s]+"
doc = re.sub(userdoll_pattern, '', doc)
# removing punctuation
punctuation = r"\(|\)|#|\'|\"|-|:|\\|\/|!|\?|_|,|=|;|>|<|\.|\#"
doc = re.sub(punctuation, ' ', doc)
return [w for w in doc.split() if len(w) > 2]
k1_tweets_processed = []
for i in range(len(k1_tweets_filtered)):
tweet_k1 = k1_tweets_filtered[i]['text']
k1_tweets_processed.append(pre_process(tweet_k1))
When you iterate it is better to use i,j for variable name, and if you have "for i n range(10)" you should not increment it inside your loop. And previously you set k1_tweets_processed to single preprocessed text instead of creating list and adding new texts to it.

How to avoid for loops and iterate through pandas dataframe properly?

I have this code that I've been struggling for a while to optimize.
My dataframe is a csv file with 2 columns, out of which the second column contains texts. Looks like on the picture:
I have a function summarize(text, n) that needs a single text and an integer as input.
def summarize(text, n):
sents = sent_tokenize(text) # text into tokenized sentences
# Checking if there are less sentences in the given review than the required length of the summary
assert n <= len(sents)
list_sentences = [word_tokenize(s.lower()) for s in sents] # word tokenized sentences
frequency = calculate_freq(list_sentences) # calculating the word frequency for all the sentences
ranking = defaultdict(int)
for i, sent in enumerate(list_sentences):
for w in sent:
if w in frequency:
ranking[i] += frequency[w]
# Calling the rank function to get the highest ranking
sents_idx = rank(ranking, n)
# Return the best choices
return [sents[j] for j in sents_idx]
So summarize() all the texts, I first iterate through my dataframe and create a list of all the texts, which I later iterate again to send them one by one to the summarize() function so I can get the summary of the text. These for loops are making my code really, really slow, but I haven't been able to figure out a way to make it more efficient, and I would greatly appreciate any suggestions.
data = pd.read_csv('dataframe.csv')
text = data.iloc[:,2] # ilocating the texts
list_of_strings = []
for t in text:
list_of_strings.append(t) # creating a list of all the texts
our_summary = []
for s in list_of_strings:
for f in summarize(s, 1):
our_summary.append(f)
ours = pd.DataFrame({"our_summary": our_summary})
EDIT:
other two functions are:
def calculate_freq(list_sentences):
frequency = defaultdict(int)
for sentence in list_sentences:
for word in sentence:
if word not in our_stopwords:
frequency[word] += 1
# We want to filter out the words with frequency below 0.1 or above 0.9 (once normalized)
if frequency.values():
max_word = float(max(frequency.values()))
else:
max_word = 1
for w in frequency.keys():
frequency[w] = frequency[w]/max_word # normalize
if frequency[w] <= min_freq or frequency[w] >= max_freq:
del frequency[w] # filter
return frequency
def rank(ranking, n):
# return n first sentences with highest ranking
return nlargest(n, ranking, key=ranking.get)
Input text: Recipes are easy and the dogs love them. I would buy this book again and again. Only thing is that the recipes don't tell you how many treats they make, but I suppose that's because you could make them all different sizes. Great buy!
Output text: I would buy this book again and again.
Have you tried something like this?
# Test data
df = pd.DataFrame({'ASIN': [0,1], 'Summary': ['This is the first text', 'Second text']})
# Example function
def summarize(text, n=5):
"""A very basic summary"""
return (text[:n] + '..') if len(text) > n else text
# Applying the function to the text
df['Result'] = df['Summary'].map(summarize)
# ASIN Summary Result
# 0 0 This is the first text This ..
# 1 1 Second text Secon..
Such a long story...
I'm going to assume since you are performing a text frequency analysis, the order of reviewText don't matter. If that is the case:
Mega_String = ' '.join(data['reviewText'])
This should concat all strings in review text function into one big string, with each review separated with a white space.
You can just throw this result to your functions.

Tips to transform a simple program in continuation programming?

For example I would like to do some NLP text treatment : extract some keywords, and find correlation between them (with previous lemma-POS segmentation).
The Pipeline would be :
count all (lemmatised) words,
make a stopwords list,
use a RAKE-like algorithm to extract keyword list,
make some frequency-correlation matrix the kw list content and/or the POS/lemma words...
For example in pseudo-python :
def count_words(infile,open_and_read) :
dic = {}
f = open_and_read(infile)
for word in f:
if word not in dic:
dic[word] = 1
else dic[word] +=1
return dic
etc etc
How do you transform this kind of pipeline in continuous programming ?

How to predict the topic of a new query using a trained LDA model using gensim?

I have trained a corpus for LDA topic modelling using gensim.
Going through the tutorial on the gensim website (this is not the whole code):
question = 'Changelog generation from Github issues?';
temp = question.lower()
for i in range(len(punctuation_string)):
temp = temp.replace(punctuation_string[i], '')
words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)
important_words = []
important_words = filter(lambda x: x not in stoplist, words)
print important_words
dictionary = corpora.Dictionary.load('questions.dict')
ques_vec = []
ques_vec = dictionary.doc2bow(important_words)
print dictionary
print ques_vec
print lda[ques_vec]
This is the output that I get:
['changelog', 'generation', 'github', 'issues']
Dictionary(15791 unique tokens)
[(514, 1), (3625, 1), (3626, 1), (3627, 1)]
[(4, 0.20400000000000032), (11, 0.20400000000000032), (19, 0.20263215848547525), (29, 0.20536784151452539)]
I don't know how the last output is going to help me find the possible topic for the question !!!
Please help!
I have written a function in python that gives the possible topic for a new query:
def getTopicForQuery (question):
temp = question.lower()
for i in range(len(punctuation_string)):
temp = temp.replace(punctuation_string[i], '')
words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)
important_words = []
important_words = filter(lambda x: x not in stoplist, words)
dictionary = corpora.Dictionary.load('questions.dict')
ques_vec = []
ques_vec = dictionary.doc2bow(important_words)
topic_vec = []
topic_vec = lda[ques_vec]
word_count_array = numpy.empty((len(topic_vec), 2), dtype = numpy.object)
for i in range(len(topic_vec)):
word_count_array[i, 0] = topic_vec[i][0]
word_count_array[i, 1] = topic_vec[i][1]
idx = numpy.argsort(word_count_array[:, 1])
idx = idx[::-1]
word_count_array = word_count_array[idx]
final = []
final = lda.print_topic(word_count_array[0, 0], 1)
question_topic = final.split('*') ## as format is like "probability * topic"
return question_topic[1]
Before going through this do refer this link!
In the initial part of the code, the query is being pre-processed so that it can be stripped off stop words and unnecessary punctuations.
Then, the dictionary that was made by using our own database is loaded.
We, then, we convert the tokens of the new query to bag of words and then the topic probability distribution of the query is calculated by topic_vec = lda[ques_vec] where lda is the trained model as explained in the link referred above.
The distribution is then sorted w.r.t the probabilities of the topics. The topic with the highest probability is then displayed by question_topic[1].
Assuming we just need topic with highest probability following code snippet may be helpful:
def findTopic(testObj, dictionary):
text_corpus = []
'''
For each query ( document in the test file) , tokenize the
query, create a feature vector just like how it was done while training
and create text_corpus
'''
for query in testObj:
temp_doc = tokenize(query.strip())
current_doc = []
for word in range(len(temp_doc)):
if temp_doc[word][0] not in stoplist and temp_doc[word][1] == 'NN':
current_doc.append(temp_doc[word][0])
text_corpus.append(current_doc)
'''
For each feature vector text, lda[doc_bow] gives the topic
distribution, which can be sorted in descending order to print the
very first topic
'''
for text in text_corpus:
doc_bow = dictionary.doc2bow(text)
print text
topics = sorted(lda[doc_bow],key=lambda x:x[1],reverse=True)
print(topics)
print(topics[0][0])
The tokenize functions removes punctuations/ domain specific characters to filtered and gives the list of tokens. Here dictionary created in training is passed as parameter of the function, but it can also be loaded from a file.
Basically, Anjmesh Pandey suggested a good example code. However the first word with highest probability in a topic may not solely represent the topic because in some cases clustered topics may have a few topics sharing those most commonly happening words with others even at the top of them. Therefore returning an index of a topic would be enough, which most likely to be close to the query.
topic_id = sorted(lda[ques_vec], key=lambda (index, score): -score)
The transformation of ques_vec gives you per topic idea and then you would try to understand what the unlabeled topic is about by checking some words mainly contributing to the topic.
latent_topic_words = map(lambda (score, word):word lda.show_topic(topic_id))
show_topic() method returns a list of tuple sorted by score of each word contributing to the topic in descending order, and we can roughly understand the latent topic by checking those words with their weights.

Categories

Resources