How to group by phrases in multi-dimensional python list - python

I'm trying to match exact phrases in a list to tweets. If a tweet has any of the items in the list phrase, I want to add that to my result which can tell me if the tweet is to my interest.
Then I want to sum up the found phrases by grouping them across the tweets and use Twitter's favourite, retweet count to have my result set which tells me if my phrases are popular.
So far I have got to a point where I can match tweets text to my phrases list. But the problem I see is, that this is becoming O(n2) in complexity & I'm not a python purist. But can someone give me an idea how to approach the solution?
I know the code is incomplete, but I'm not finding a way to approach this problem.
attributes = ["denim", "solid", "v-neck"]
tweets = ["#hashtag1 new maternity autumn denim A -line dresses solid knee-length v-neck",
"RT #amyohconnor: Thank you, Daily Mail, for addressing working women's chief concern: how to dress for dosy male colleagues who don't appre…",
"Why are so many stores selling little girl dresses for women? Its so fucking creepy",
"top Online Shopping Indian Dresses for Women Salwar Kameez Suits Design Indian Dresses in amazon"
]
i = 0
trending_attributes = []
for tweet in tweets:
tweet = tweet.lower()
found_attributes = []
for attribute in attributes:
if attribute.lower() in tweet:
found_attributes.append(attribute)
trending_attributes.append([i, found_attributes])
i += 1
print(trending_attributes)
Output as of now,
[[0, ['denim', 'knee-length', 'solid', 'v-neck']], [1, []], [2, ['girl', 'ny']], [3, ['pin', 'shopping']]]
What I'm looking for:
attrib1: RetweetCount, FavouriteCount
attrib2: RetweetCount, FavouriteCount
attrib3: RetweetCount, FavouriteCount

If your attributes don’t have spaces inside of them (or don’t overlap), you can get this down to O(n) complexity (average case, per tweet) using set.
tweet = "we begin our story in new york"
words = set(tweet.split())
for attribute in attributes:
if attribute in words:
found_attributes.append(attribute)
If your phrases are multiple words long, and you want to be clever, consider creating a dictionary with the words as keys and sets of the positions as values
positions = {"we": {0}, "our": {2}, "begin": {1} ... }
Then loop through the words in each attribute, and see if they are in sequential order. This should be O(1) for each lookup making it O(n) for the total number of words in your attributes. Creating the positions dictionary is O(m) (where m is the number of words in the tweet) since you iterate through enumerated(tweet.split()) to make it. Keep in mind you still have to do this for every tweet in your list.
Tweets are more or less of constant size size, though, so asymptotic analysis is not very useful here. If you are looking for relationships between two arrays of data, it will usually be at a minimum O(p*q) where p and q are the dimensions of the data. I don’t think you can do much better than quadratic time if both your attributes and your list of tweets are of variable size. However I suspect your attributes list is of “constant” size, so this may effectively be linear in time.

Related

How do you summarize a list of 100 list of terms into a small number (~5) significant terms?

I have a list that has 100 terms that looks like this:
['Paycheck', 'Unemployment claim', 'credit card', 'minimum wage job', 'Instagram', 'xxx I hate you', 'u r sexy',....'hi Mr Robinson', 'savings', 'severance pay', 'deductible', 'rent forgiveness']
I want to ideally summarize this into a short list of 5~6 relevant terms, e.g.:
['Pay','Financial issues','unemployment','social benefits','money problems']
If that's too much, at least rank the terms by relevance and then list the top 5 terms.
Problem is most text analysis tools use term frequency and or work across multiple documents, but in my case, my list has only unique terms, and also contains noise (unrelated terms).
Is there any Python tool or API that can pull this off?
To clarify what I mean about relevance, consider the following simple example:
['Dollars','Cash','International Currency','Credit card','Comics','loans','David Beckham']
Most of the terms are related to money or finance, so 'Comics' and 'David Beckham' are irrelevant and needed to be discarded or placed at the lowest rank or something.
Then either provide two 'summary' terms like 'money' and 'finance' or just rank words by how close they are to the common meaning.
You can try word embeddings such as word2vec or BERT that for each term it gives a vector to you and you can compare values of these vectors using cosine similarity and put most similar terms in one array.
for more information you can look at these codes:
https://chainer-colab-notebook.readthedocs.io/en/latest/notebook/official_example/word2vec.html
https://colab.research.google.com/drive/1ZQvuAVwA3IjybezQOXnrXMGAnMyZRuPU

How to generate the result of bigrams with highest probabilities with a list of individual alphabetical strings as input

I am learning natural language processing for bigram topic. At this stage, I am having difficulty in the Python computation, but I try.
I will be using this corpus that has not been subjected to tokenization as my main raw dataset. I can generate the bigram results using nltk module. However, my question is how to compute in Python to generate the bigrams containing more than two specific words. More specifically, I wish to find all the bigrams, which are available in corpus_A, that contain words from the word_of_interest.
corpus = ["he is not giving up so easily but he feels lonely all the time his mental is strong and he always meet new friends to get motivation and inspiration to success he stands firm for academic integrity when he was young he hope that santa would give him more friends after he is a grown up man he stops wishing for santa clauss to arrival he and his friend always eat out but they clean their hand to remove sand first before eating"]
word_of_interest = ['santa', 'and', 'hand', 'stands', 'handy', 'sand']
I want to get the bigram for each of the individual words from the list of word_of_interest. Next, I want to get the frequency for each bigram available based on their appearance in the corpus_A. With the frequency available, I want to sort and print out the bigram based on their probability from highest to lower.
I have tried out codes from on-line search but it does not give me an output. The codes are mentioned below:
for i in corpus:
bigrams_i = BigramCollocationFinder.from_words(corpus, window_size=5)
bigram_j = lambda i[x] not in i
x += 1
print(bigram_j)
Unfortunately, the output did not return what I am planning to achieve.
Please advice me. The output that I want will have the bigram with the specific words from the word_of_interest and their probabilities sorted as shown below.
[((santa, clauss), 0.89), ((he, and), 0.67), ((stands, firm), 0.34))]
You can try this code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(ngram_range=(2,2),use_idf=False)
corpus = ["he is not giving up so easily but he feels lonely all the time his mental is strong and he always meet new friends to get motivation and inspiration to success he stands firm for academic integrity when he was young he hope that santa would give him more friends after he is a grown up man he stops wishing for santa clauss to arrival he and his friend always eat out but they clean their hand to remove sand first before eating"]
word_of_interest = ['santa', 'and', 'hand', 'stands', 'handy', 'sand']
matrix = vec.fit_transform(corpus).toarray()
vocabulary = vec.get_feature_names()
all_bigrams = []
all_frequencies = []
for word in word_of_interest:
for bigram in vocabulary:
if word in bigram:
index = vocabulary.index(bigram)
tuple_bigram = tuple(bigram.split(' '))
frequency = matrix[:,index].sum()
all_bigrams.append(tuple_bigram)
all_frequencies.append(frequency)
df = pd.DataFrame({'bigram':all_bigrams,'frequency':all_frequencies})
df.sort_values('frequency',inplace=True)
df.head()
The output is a pandas dataframe showing the bigrams sorted by frequency.
bigram frequency
0 (for, santa) 0.109764
19 (stands, firm) 0.109764
18 (he, stands) 0.109764
17 (their, hand) 0.109764
16 (hand, to) 0.109764
The rationale here is that TfidfVectorizer counts how many times a token is present in each document of a corpus, then computes the term-specific frequency, and then stores this information in a column associated with that token. The index of that column is the same as the index of the associated word in the vocabulary that is retrieved with the method .get_feature_names() on the vectorizer already fit.
Then you just have to select all rows from the matrix containing the relative frequencies of the tokens, and sum along the column of interest.
The double-nested for loop is not ideal though, and there may be a more efficient implementation for it. The issue is that get_feature_names returns not tuples, but a list of strings in the form ['first_token second_token',].
I would be interested in seeing a better implementation of the second half of the above code.

clustering similar words and then mapping clusters into numbers in python

I'm familiar with k-means to cluster data points, but not text.. So I have one column of words( some rows has only one word and some have more etc.) in cvs format, which I want to cluster those which have similar word or more, and then mapping those cluster to numbers as index, those index numbers needed to be added as a second column. I know there is scipy packahes and word2vec in python, but this is the first time for me dealing with clustering text.. Any idea on how to do this?? Any code examples will be appreciated
Edit:What I want are not the similar words in meaning, I want the similar like the exact text, for example: we have three words in different three rows: Heart attack , Heart failure, Heart broken.. for example .. I want those rows to be in one cluster cause they have the common word " Heart... And by the way, all the rows are connected with each other somehow, so what I really want is to cluster the exact words
from csv import DictReader
import sets
### converting my cvs file into list!!
with open("export.csv") as f:
my_list = [row["BASE_NAME"] for row in DictReader(f)]
#print(my_list)
## having every word in the cvs file
Set = list()
for item in my_list:
MySet = list(set(item.split(' ')))
Set.append(MySet)
#print(Set)
cleanlist = []
[cleanlist.append(x) for x in Set if x not in cleanlist]
print(cleanlist[1])
#print(cleanlist)
###my_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
#for item in my_list:
for i in xrange(len(cleanlist)):
# matching = [s for s in my_list if cleanlist[i] in s]
# matching = [x for x in my_list if cleanlist[i] in x]
matching = any( cleanlist[[i]] in item for item in my_list)
print(matching)
Sample of my_list is ['Carbon Monoxide (Blood)', 'Carbon Monoxide Poisoning', 'Carbonic anhydrase inhibitor administered']
Sample of cleanlist is [['Antibody', 'Cardiolipin'], ['Cardiomegaly'], ['Cardiomyopathy'], ['Cardiopulmonary', 'Resuscitation', '(CPR)'], ['Diet', 'Cardiovascular'], ['Disease', 'Cardiovascular']]
SOLVED[Now I'm having problems, My cleanlist is not containing only one item for each index, which makes the compare for matching hard, how to fix that??]
????Also, I want to create a list for each time of comparing, so for each comparing of the clean list, I want to create one list which will have the similar words between them,,, any help with that please??
I'm a newbie of machine-learning, but maybe I can give you some advise.
let's suppose we have one row:
rowText = "mary have a little lamb"
then split the word and put it in a set:
MySet = set(rowText.split(' '))
now we can add more row to this set, finally we got a big set including all words.
MySet.add(newRowText.split(' '))
now, we should remove some not so important words such as "a", "an", "is", "are",etc.
Convert the set back to list with order, check the length of the List. now we can create a N-Dimension space, for example, the total List is ["Mary", "has", "lamb", "happy"]. the row is "Mary has a little lamb", so we can convert the sentence to [1, 1, 1, 0].
Now you can do Cluster now. right?
If you found the "Mary" is very important, you can adjust the weight of Mary like [2,1,1,0].
The precess way you can refer from Bayes as my opinion.
Clustering text is hard, and most approaches do not work well. Clustering single words essentially requires a lot of background knowledge.
If you had longer text, you could measure similarity by the words they have in common.
But for single words, this approach does not work.
Consider:
Apple
Banana
Orange
Pear
Pea
To a human who knows a lot, apple and pear are supposedly the two most similar. To a computer who has only these 3 to 6 bytes of string data, pear and pea are the most similar words.
You see: language is a lot about background knowledge and associations. A computer who cannot associate both "apple" and "pear" with "fruit growing on a tree, usually green outside and white inside, black seeds in the center and a stem on top, about the size of a palm usually" cannot recognize what these have in common, and thus cannot cluster them.
For clustering you need some kind of distance measurement. I suggest using the Hamming Distance (see https://en.wikipedia.org/wiki/Hamming_distance). I think it's common to use it to measure similarity between two words.
Edit:
for your examples this would mean
Heart attack
Heart failure => dist 7
Heart attack
Heart broken => dist 6
Heart failure
Heart broken => dist 7
Heart broken
Banana => dist 12

Use Gensim for scoring features in each document. Also a Python memory issue

I am using GENSIM on a corpus of 50000 documents along with a dictionary of around 4000 features. I also have a LSI model already prepared for the same.
Now I want to find the highest matching features for each of the added documents. To find the best features in a particular document, I am running gensim's similarity module for each of the features on all the documents. This gives us a score for each of the feature that we want to use later on. But as you can imagine, this is a costly process as we have to iterate over 50000 indices and run 4000 iterations of similarity on each.
I need a better way of doing this as I run out of 8 GB memory on my system at around 1000 iterations. There's actually no reason for the memory to keep rising as I am only reallocating it during the iterations. Surprisingly the memory starts rising only after around 200 iterations.
Why the memory issue? How can it be solved?
Is there a better way of finding the highest scored features in a particular document (not topics)?
Here's a snippet of the code that runs out of memory:
dictionary = corpora.Dictionary.load('features-dict.dict')
corpus = corpora.MmCorpus('corpus.mm')
lsi = models.LsiModel.load('model.lsi')
corpus_lsi = lsi[corpus]
index = similarities.MatrixSimilarity(list(corpus_lsi))
newDict = dict()
for feature in dictionary.token2id.keys():
vec_bow = dictionary.doc2bow([feature])
vec_lsi = lsi[vec_bow]
sims = index[vec_lsi]
li = sorted(enumerate(sims * 100), key=lambda item: -item[1])
for data in li:
dict[data[0]] = (feature,data[1]) # Store feature and score for each document
# Do something with the dict created above
EDIT:
The memory issue was resolved using a memory profiler. There was something else in that loop that caused it to rise drastically.
Let me explain the purpose in detail. Imagine we are dealing with various recipes (each recipe is document) and each item in our dictionary is an ingredient. Find six such recipes below.
corpus = [[Olive Oil, Tomato, Brocolli, Oregano], [Garlic, Olive Oil, Bread, Cheese, Oregano], [Avocado, Beans, Cheese, Lime], [Jalepeneo, Lime, Tomato, Tortilla, Sour Cream], [Chili Sauce, Vinegar, Mushrooms, Rice], [Soy Sauce, Noodles, Brocolli, Ginger, Vinegar]]
There are thousands of such recipes. What I am trying to achieve is to assign a weight between 0 and 100 to each of the ingredient (where higher weighted ingredient is the most important or most unique). What would be the best way to achieve this.
Let's break this down:
unless I misunderstood your purpose, you can simple use the left singular vectors from lsi.projection.u to get your weights:
# create #features x #corpus 2D matrix of weights
doc_feature_matrix = numpy.dot(lsi.projection.u, index.index.T)
Rows of this matrix should be the "documents weights" you're looking for, one row for one feature.
the call to list() in your list(lsi[corpus]) makes your code very inefficient. It basically serializes the entire doc-topic matrix into RAM. Drop the list() and use the streamed version directly, it's much more memory-efficient: index = MatrixSimilarity(lsi[corpus], num_features=lsi.num_topics).
LSI usually works better over regularized input. Consider transforming the plain bag-of-words vectors (=integers) via e.g. TF-IDF or log entropy transformation before passing it to LSI.

tag generation from a text content

I am curious if there is an algorithm/method exists to generate keywords/tags from a given text, by using some weight calculations, occurrence ratio or other tools.
Additionally, I will be grateful if you point any Python based solution / library for this.
Thanks
One way to do this would be to extract words that occur more frequently in a document than you would expect them to by chance. For example, say in a larger collection of documents the term 'Markov' is almost never seen. However, in a particular document from the same collection Markov shows up very frequently. This would suggest that Markov might be a good keyword or tag to associate with the document.
To identify keywords like this, you could use the point-wise mutual information of the keyword and the document. This is given by PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]. This will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection.
To identify the 5 best keywords to associate with a document, you would just sort the terms by their PMI score with the document and pick the 5 with the highest score.
If you want to extract multiword tags, see the StackOverflow question How to extract common / significant phrases from a series of text entries.
Borrowing from my answer to that question, the NLTK collocations how-to covers how to do
extract interesting multiword expressions using n-gram PMI in a about 7 lines of code, e.g.:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
# change this to read in your data
finder = BigramCollocationFinder.from_words(
nltk.corpus.genesis.words('english-web.txt'))
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# return the 5 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 5)
First, the key python library for computational linguistics is NLTK ("Natural Language Toolkit"). This is a stable, mature library created and maintained by professional computational linguists. It also has an extensive collection of tutorials, FAQs, etc. I recommend it highly.
Below is a simple template, in python code, for the problem raised in your Question; although it's a template it runs--supply any text as a string (as i've done) and it will return a list of word frequencies as well as a ranked list of those words in order of 'importance' (or suitability as keywords) according to a very simple heuristic.
Keywords for a given document are (obviously) chosen from among important words in a document--ie, those words that are likely to distinguish it from another document. If you had no a priori knowledge of the text's subject matter, a common technique is to infer the importance or weight of a given word/term from its frequency, or importance = 1/frequency.
text = """ The intensity of the feeling makes up for the disproportion of the objects. Things are equal to the imagination, which have the power of affecting the mind with an equal degree of terror, admiration, delight, or love. When Lear calls upon the heavens to avenge his cause, "for they are old like him," there is nothing extravagant or impious in this sublime identification of his age with theirs; for there is no other image which could do justice to the agonising sense of his wrongs and his despair! """
BAD_CHARS = ".!?,\'\""
# transform text into a list words--removing punctuation and filtering small words
words = [ word.strip(BAD_CHARS) for word in text.strip().split() if len(word) > 4 ]
word_freq = {}
# generate a 'word histogram' for the text--ie, a list of the frequencies of each word
for word in words :
word_freq[word] = word_freq.get(word, 0) + 1
# sort the word list by frequency
# (just a DSU sort, there's a python built-in for this, but i can't remember it)
tx = [ (v, k) for (k, v) in word_freq.items()]
tx.sort(reverse=True)
word_freq_sorted = [ (k, v) for (v, k) in tx ]
# eg, what are the most common words in that text?
print(word_freq_sorted)
# returns: [('which', 4), ('other', 4), ('like', 4), ('what', 3), ('upon', 3)]
# obviously using a text larger than 50 or so words will give you more meaningful results
term_importance = lambda word : 1.0/word_freq[word]
# select document keywords from the words at/near the top of this list:
map(term_importance, word_freq.keys())
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation tries to represent each document in a training corpus as mixture of topics, which in turn are distributions mapping words to probabilities.
I had used it once to dissect a corpus of product reviews into the latent ideas that were being spoken about across all the documents such as 'customer service', 'product usability', etc.. The basic model does not advocate a way to convert the topic models into a single word describing what a topic is about.. but people have come up with all kinds of heuristics to do that once their model is trained.
I recommend you try playing with http://mallet.cs.umass.edu/ and seeing if this model fits your needs..
LDA is a completely unsupervised algorithm meaning it doesn't require you to hand annotate anything which is great, but on the flip side, might not deliver you the topics you were expecting it to give.
A very simple solution to the problem would be:
count the occurences of each word in the text
consider the most frequent terms as the key phrases
have a black-list of 'stop words' to remove common words like the, and, it, is etc
I'm sure there are cleverer, stats based solutions though.
If you need a solution to use in a larger project rather than for interests sake, Yahoo BOSS has a key term extraction method.
Latent Dirichlet allocation or Hierarchical Dirichlet Process can be used to generate tags for individual texts within a greater corpus (body of texts) by extracting the most important words from the derived topics.
A basic example would be if we were to run LDA over a corpus and define it to have two topics, and that we find further that a text in the corpus is 70% one topic, and 30% another. The top 70% of the words that define the first topic and 30% that define the second (without duplication) could then be considered as tags for the given text. This method provides strong results where tags generally represent the broader themes of the given texts.
With a general reference for preprocessing needed for these codes being found here, we can find tags through the following process using gensim.
A heuristic way of deriving the optimal number of topics for LDA is found in this answer. Although HDP does not require the number of topics as an input, the standard in such cases is still to use LDA with a derived topic number, as HDP can be problematic. Assume here that the corpus is found to have 10 topics, and we want 5 tags per text:
from gensim.models import LdaModel, HdpModel
from gensim import corpora
num_topics = 10
num_tags = 5
Assume further that we have a variable corpus, which is a preprocessed list of lists, with the subslist entries being word tokens. Initialize a Dirichlet dictionary and create a bag of words where texts are converted to their indexes for their component tokens (words):
dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]
Create an LDA or HDP model:
dirichlet_model = LdaModel(corpus=bow_corpus,
id2word=dirichlet_dict,
num_topics=num_topics,
update_every=1,
chunksize=len(bow_corpus),
passes=20,
alpha='auto')
# dirichlet_model = HdpModel(corpus=bow_corpus,
# id2word=dirichlet_dict,
# chunksize=len(bow_corpus))
The following code produces ordered lists for the most important words per topic (note that here is where num_tags defines the desired tags per text):
shown_topics = dirichlet_model.show_topics(num_topics=num_topics,
num_words=num_tags,
formatted=False)
model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]
Then find the coherence of the topics across the texts:
topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0
topics_per_text = [text for text in topic_corpus]
From here we have the percentage that each text coheres to a given topic, and the words associated with each topic, so we can combine them for tags with the following:
corpus_tags = []
for i in range(len(bow_corpus)):
# The complexity here is to make sure that it works with HDP
significant_topics = list(set([t[0] for t in topics_per_text[i]]))
topic_indexes_by_coherence = [tup[0] for tup in sorted(enumerate(topics_per_text[i]), key=lambda x:x[1])]
significant_topics_by_coherence = [significant_topics[i] for i in topic_indexes_by_coherence]
ordered_topics = [model_topics[i] for i in significant_topics_by_coherence][:num_topics] # subset for HDP
ordered_topic_coherences = [topics_per_text[i] for i in topic_indexes_by_coherence][:num_topics] # subset for HDP
text_tags = []
for i in range(num_topics):
# Find the number of indexes to select, which can later be extended if the word has already been selected
selection_indexes = list(range(int(round(num_tags * ordered_topic_coherences[i]))))
if selection_indexes == [] and len(text_tags) < num_tags:
# Fix potential rounding error by giving this topic one selection
selection_indexes = [0]
for s_i in selection_indexes:
# ignore_words is a list of words should not be included
if ordered_topics[i][s_i] not in text_tags and ordered_topics[i][s_i] not in ignore_words:
text_tags.append(ordered_topics[i][s_i])
else:
selection_indexes.append(selection_indexes[-1] + 1)
# Fix for if too many were selected
text_tags = text_tags[:num_tags]
corpus_tags.append(text_tags)
corpus_tags will be a list of tags for each text based on how coherent the text is to the derived topics.
See this answer for a similar version of this that generates tags for a whole text corpus.

Categories

Resources