clustering similar words and then mapping clusters into numbers in python - python

I'm familiar with k-means to cluster data points, but not text.. So I have one column of words( some rows has only one word and some have more etc.) in cvs format, which I want to cluster those which have similar word or more, and then mapping those cluster to numbers as index, those index numbers needed to be added as a second column. I know there is scipy packahes and word2vec in python, but this is the first time for me dealing with clustering text.. Any idea on how to do this?? Any code examples will be appreciated
Edit:What I want are not the similar words in meaning, I want the similar like the exact text, for example: we have three words in different three rows: Heart attack , Heart failure, Heart broken.. for example .. I want those rows to be in one cluster cause they have the common word " Heart... And by the way, all the rows are connected with each other somehow, so what I really want is to cluster the exact words
from csv import DictReader
import sets
### converting my cvs file into list!!
with open("export.csv") as f:
my_list = [row["BASE_NAME"] for row in DictReader(f)]
#print(my_list)
## having every word in the cvs file
Set = list()
for item in my_list:
MySet = list(set(item.split(' ')))
Set.append(MySet)
#print(Set)
cleanlist = []
[cleanlist.append(x) for x in Set if x not in cleanlist]
print(cleanlist[1])
#print(cleanlist)
###my_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
#for item in my_list:
for i in xrange(len(cleanlist)):
# matching = [s for s in my_list if cleanlist[i] in s]
# matching = [x for x in my_list if cleanlist[i] in x]
matching = any( cleanlist[[i]] in item for item in my_list)
print(matching)
Sample of my_list is ['Carbon Monoxide (Blood)', 'Carbon Monoxide Poisoning', 'Carbonic anhydrase inhibitor administered']
Sample of cleanlist is [['Antibody', 'Cardiolipin'], ['Cardiomegaly'], ['Cardiomyopathy'], ['Cardiopulmonary', 'Resuscitation', '(CPR)'], ['Diet', 'Cardiovascular'], ['Disease', 'Cardiovascular']]
SOLVED[Now I'm having problems, My cleanlist is not containing only one item for each index, which makes the compare for matching hard, how to fix that??]
????Also, I want to create a list for each time of comparing, so for each comparing of the clean list, I want to create one list which will have the similar words between them,,, any help with that please??

I'm a newbie of machine-learning, but maybe I can give you some advise.
let's suppose we have one row:
rowText = "mary have a little lamb"
then split the word and put it in a set:
MySet = set(rowText.split(' '))
now we can add more row to this set, finally we got a big set including all words.
MySet.add(newRowText.split(' '))
now, we should remove some not so important words such as "a", "an", "is", "are",etc.
Convert the set back to list with order, check the length of the List. now we can create a N-Dimension space, for example, the total List is ["Mary", "has", "lamb", "happy"]. the row is "Mary has a little lamb", so we can convert the sentence to [1, 1, 1, 0].
Now you can do Cluster now. right?
If you found the "Mary" is very important, you can adjust the weight of Mary like [2,1,1,0].
The precess way you can refer from Bayes as my opinion.

Clustering text is hard, and most approaches do not work well. Clustering single words essentially requires a lot of background knowledge.
If you had longer text, you could measure similarity by the words they have in common.
But for single words, this approach does not work.
Consider:
Apple
Banana
Orange
Pear
Pea
To a human who knows a lot, apple and pear are supposedly the two most similar. To a computer who has only these 3 to 6 bytes of string data, pear and pea are the most similar words.
You see: language is a lot about background knowledge and associations. A computer who cannot associate both "apple" and "pear" with "fruit growing on a tree, usually green outside and white inside, black seeds in the center and a stem on top, about the size of a palm usually" cannot recognize what these have in common, and thus cannot cluster them.

For clustering you need some kind of distance measurement. I suggest using the Hamming Distance (see https://en.wikipedia.org/wiki/Hamming_distance). I think it's common to use it to measure similarity between two words.
Edit:
for your examples this would mean
Heart attack
Heart failure => dist 7
Heart attack
Heart broken => dist 6
Heart failure
Heart broken => dist 7
Heart broken
Banana => dist 12

Related

How to cluster sentences by using n-gram overlap in Python?

I need to cluster sentences according to common n-grams they contain. I am able to extract n-grams easily with nltk but I have no idea how to perform clustering based on n-gram overlap. That is why I couldn't write such a real code, first of all I am sorry for it. I wrote 6 simple sentences and expected output to illustrate the problem.
import nltk
Sentences= """I would like to eat pizza with her.
She would like to eat pizza with olive.
There are some sentences must be clustered.
These sentences must be clustered according to common trigrams.
The quick brown fox jumps over the lazy dog.
Apples are red, bananas are yellow."""
sent_detector = nltk.data.load('tokenizers/punkt/'+'English'+'.pickle')
sentence_tokens = sent_detector.tokenize(sentences.strip())
mytrigrams=[]
for sentence in sentence_tokens:
trigrams=ngrams(sentence.lower().split(), 3)
mytrigrams.append(list(trigrams))
After this I have no idea (I am not even sure whether this part is okay.) how to cluster them according to common trigrams. I tried to do with itertools-combinations but I got lost, and I didn't know how to generate multiple clusters, since the number of clusters can not be known without comparing each sentence with each other. The expected output is given below, thanks in advance for any help.
Cluster1: 'I would like to eat pizza with her.'
'She would like to eat pizza with olive.'
Cluster2: 'There are some sentences must be clustered.'
'These sentences must be clustered according to common trigrams.'
Sentences do not belong to any cluster:
'The quick brown fox jumps over the lazy dog.'
'Apples are red, bananas are yellow.'
EDIT: I have tried with combinations one more time, but it didn't cluster at all, just returned the all sentence pairs. (obviously I did something wrong).
from itertools import combinations
new_dict = {k: v for k, v in zip(sentence_tokens, mytrigrams)}
common=[]
no_cluster=[]
sentence_pairs=combinations(new_dict.keys(), 2)
for keys, values in new_dict.items():
for values in sentence_pairs:
sentence1= values[0]
sentence2= values[1]
#print(sentence1, sentence2)
if len(set(sentence1) & set(sentence2))!=0:
common.append((sentence1, sentence2))
else:
no_cluster.append((sentence1, sentence2))
print(common)
But even if this code worked it would not give the output I expect, as I don't know how to generate multiple clusters based on common n-grams
To better understand your problem, you could explain the purpose and expected outcome.
Using Ngrams is something that must be done very carefully, when using ngrams, you increase the number of dimensions of your dataset.
I advise you to first use TD-IDF and only then if you have not reached the minimum hit rate, you go to n-grams.
If you can better explain your problem I can see if I can help you.

SequenceMatcher - finding the two most similar elements of two or more lists of data

I was trying to compare a set of strings to an already defined set of strings.
For example, you want to find the addressee of a letter, which text is digitalized via OCR.
There is an array of adresses, which has dictionaries as elements.
Each element, which is unique, contains ID, Name, Street, ZIP Code and City. This list will be 1000 entries long.
Since OCR scanned text can be inaccurate, we need to find the best matching candidates of strings with the list, which contains the addresses.
The text is 750 words long. We reduce the number of words by using an appropiate filter function, which firstly splits by whitespaces, stripts more whitespaces from each element, deletes all words less then 5 characters long and removes duplicats; the resulting list is 200 words long.
Since each addressee has 4 strings (Name Street, Zip code and city) and the remaining letter ist 200 words long, my comparrisson has to run 4 * 1000 * 200
= 800'000 times.
I have used python with medium success. Matches have correctly been found. However, the algorithm takes a long time to process a lot of letters (up to 50 hrs per 1500 letters). List comprehension has been applied. Is there a way to correctly (and not unessesary) implement multithreading? What if this application needs to run on a low spec server? My 6 core CPU does not complain about such tasks, however, I do not know how much time it will take to process a lot of documents on a small AWS instance.
>> len(addressees)
1000
>> addressees[0]
{"Name": "John Doe", "Zip": 12345, "Street": "Boulevard of broken dreams 2", "City": "Stockholm"}
>> letter[:5] # already filtered
["Insurance", "Taxation", "Identification", "1592212", "St0ckhlm", "Mozart"]
>> from difflib import SequenceMatcher
>> def get_similarity_per_element(addressees, letter):
"""compare the similarity of each word in the letter with the addressees"""
ratios = []
for l in letter:
for a in addressee.items():
ratios.append(int(100 * SequenceMatcher(None, a, l).ratio())) # using ints for faster arithmatic
return max(ratios)
>> get_similarity_per_element(addressees[0], letter[:5]) # percentage of the most matching word in the letter with anything from the addressee
82
>> # then use this method to find all addressents with the max matching ratio
>> # if only one is greater then the others -> Done
>> # if more then one, but less then 3 are equal -> Interactive Promt -> Done
>> # else -> mark as not sortable -> Done.
I expected a faster processing for each document. (1 minute max), not 50 hrs per 1500 letters. I am sure this is the bottleneck, since the other tasks are working fast and flawless.
Is there a better (faster) way to do this?
A few quick tips:
1) Let me know how long does it take to do quick_ratio() or real_quick_ratio() instead of ratio()
2) Invert the order of the loops and use set_seq2 and set_seq1 so that SequenceMatcher reuses information
for a in addressee.items():
s = SequenceMatcher()
s.set_seq2(a)
for l in letter:
s.set_seq1(l)
ratios.append(int(100 * s.ratio()))
But a better solution would be something like #J_H describes
You want to recognize inputs that are similar to dictionary words, e.g. "St0ckholm" -> "Stockholm". Transposition typos should be handled. Ok.
Possibly you would prefer to set autojunk=False. But a quadratic or cubic algorithm sounds like trouble if you're in a hurry.
Consider the Anagram Problem, where you're asked if an input word and a dictionary word are anagrams of one another. The straightforward solution is to compare the sorted strings for equality. Let's see if we can adapt that idea into a suitable data structure for your problem.
Pre-process your dictionary words into canonical keys that are easily looked up, and hang a list of one or more words off of each key. Use sorting to form the key. So for example we would have:
'dgo' -> ['dog', 'god']
Store this map sorted by key.
Given an input word, you want to know if exactly that word appears in the dictionary, or if a version with limited edit distance appears in the dictionary. Sort the input word and probe the map for 1st entry greater or equal to that. Retrieve the (very short) list of candidate words and evaluate the distance between each of them and your input word. Output the best match. This happens very quickly.
For fuzzier matching, use both the 1st and 2nd entries >= target, plus the preceding entry, so you have a larger candidate set. Also, so far this approach is sensitive to deletion of "small" letters like "a" or "b", due to ascending sorting. So additionally form keys with descending sort, and probe the map for both types of key.
If you're willing to pip install packages, consider import soundex, which deliberately discards information from words, or import fuzzywuzzy.

How to generate the result of bigrams with highest probabilities with a list of individual alphabetical strings as input

I am learning natural language processing for bigram topic. At this stage, I am having difficulty in the Python computation, but I try.
I will be using this corpus that has not been subjected to tokenization as my main raw dataset. I can generate the bigram results using nltk module. However, my question is how to compute in Python to generate the bigrams containing more than two specific words. More specifically, I wish to find all the bigrams, which are available in corpus_A, that contain words from the word_of_interest.
corpus = ["he is not giving up so easily but he feels lonely all the time his mental is strong and he always meet new friends to get motivation and inspiration to success he stands firm for academic integrity when he was young he hope that santa would give him more friends after he is a grown up man he stops wishing for santa clauss to arrival he and his friend always eat out but they clean their hand to remove sand first before eating"]
word_of_interest = ['santa', 'and', 'hand', 'stands', 'handy', 'sand']
I want to get the bigram for each of the individual words from the list of word_of_interest. Next, I want to get the frequency for each bigram available based on their appearance in the corpus_A. With the frequency available, I want to sort and print out the bigram based on their probability from highest to lower.
I have tried out codes from on-line search but it does not give me an output. The codes are mentioned below:
for i in corpus:
bigrams_i = BigramCollocationFinder.from_words(corpus, window_size=5)
bigram_j = lambda i[x] not in i
x += 1
print(bigram_j)
Unfortunately, the output did not return what I am planning to achieve.
Please advice me. The output that I want will have the bigram with the specific words from the word_of_interest and their probabilities sorted as shown below.
[((santa, clauss), 0.89), ((he, and), 0.67), ((stands, firm), 0.34))]
You can try this code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(ngram_range=(2,2),use_idf=False)
corpus = ["he is not giving up so easily but he feels lonely all the time his mental is strong and he always meet new friends to get motivation and inspiration to success he stands firm for academic integrity when he was young he hope that santa would give him more friends after he is a grown up man he stops wishing for santa clauss to arrival he and his friend always eat out but they clean their hand to remove sand first before eating"]
word_of_interest = ['santa', 'and', 'hand', 'stands', 'handy', 'sand']
matrix = vec.fit_transform(corpus).toarray()
vocabulary = vec.get_feature_names()
all_bigrams = []
all_frequencies = []
for word in word_of_interest:
for bigram in vocabulary:
if word in bigram:
index = vocabulary.index(bigram)
tuple_bigram = tuple(bigram.split(' '))
frequency = matrix[:,index].sum()
all_bigrams.append(tuple_bigram)
all_frequencies.append(frequency)
df = pd.DataFrame({'bigram':all_bigrams,'frequency':all_frequencies})
df.sort_values('frequency',inplace=True)
df.head()
The output is a pandas dataframe showing the bigrams sorted by frequency.
bigram frequency
0 (for, santa) 0.109764
19 (stands, firm) 0.109764
18 (he, stands) 0.109764
17 (their, hand) 0.109764
16 (hand, to) 0.109764
The rationale here is that TfidfVectorizer counts how many times a token is present in each document of a corpus, then computes the term-specific frequency, and then stores this information in a column associated with that token. The index of that column is the same as the index of the associated word in the vocabulary that is retrieved with the method .get_feature_names() on the vectorizer already fit.
Then you just have to select all rows from the matrix containing the relative frequencies of the tokens, and sum along the column of interest.
The double-nested for loop is not ideal though, and there may be a more efficient implementation for it. The issue is that get_feature_names returns not tuples, but a list of strings in the form ['first_token second_token',].
I would be interested in seeing a better implementation of the second half of the above code.

How to build bag of phrase (and remainging words) in pandas dataset

Here's my dataset
I love you baby
I love stackoverflow
I have stackoverflow account
What I want
I Love 2
stackoverflow 2
you 1
baby 1
I 1 # the other two already on "I love"
...
What I want is if there's any more than one word that come more than once in dataframe is on my bag of phrase
I quite sure pandas doesn't have a ready to go tool for this case.
You need to execute algorithm:
In this case I can think about something like this:
split all the text to one array
in the end of each line add unique word (like: end_of_line_01, end_of_line_02, etc.)
so after it you have array like this:
I, love, you, baby, end_of_line_01, I, love, stackoverflow,
end_of_line_02, I, have, stackoverflow, account, end_of_line_03
take the first 2 words and search in the array if this words exist in any other place in the array in the same order.
a. if yes keep the result of how many time. and try the same with one more words
b. if not count only the first word.
in the end of this step remove the words that been taken and add it to the result
repeat step 3
remove all the unique words the you add from the final result

How to group by phrases in multi-dimensional python list

I'm trying to match exact phrases in a list to tweets. If a tweet has any of the items in the list phrase, I want to add that to my result which can tell me if the tweet is to my interest.
Then I want to sum up the found phrases by grouping them across the tweets and use Twitter's favourite, retweet count to have my result set which tells me if my phrases are popular.
So far I have got to a point where I can match tweets text to my phrases list. But the problem I see is, that this is becoming O(n2) in complexity & I'm not a python purist. But can someone give me an idea how to approach the solution?
I know the code is incomplete, but I'm not finding a way to approach this problem.
attributes = ["denim", "solid", "v-neck"]
tweets = ["#hashtag1 new maternity autumn denim A -line dresses solid knee-length v-neck",
"RT #amyohconnor: Thank you, Daily Mail, for addressing working women's chief concern: how to dress for dosy male colleagues who don't appre…",
"Why are so many stores selling little girl dresses for women? Its so fucking creepy",
"top Online Shopping Indian Dresses for Women Salwar Kameez Suits Design Indian Dresses in amazon"
]
i = 0
trending_attributes = []
for tweet in tweets:
tweet = tweet.lower()
found_attributes = []
for attribute in attributes:
if attribute.lower() in tweet:
found_attributes.append(attribute)
trending_attributes.append([i, found_attributes])
i += 1
print(trending_attributes)
Output as of now,
[[0, ['denim', 'knee-length', 'solid', 'v-neck']], [1, []], [2, ['girl', 'ny']], [3, ['pin', 'shopping']]]
What I'm looking for:
attrib1: RetweetCount, FavouriteCount
attrib2: RetweetCount, FavouriteCount
attrib3: RetweetCount, FavouriteCount
If your attributes don’t have spaces inside of them (or don’t overlap), you can get this down to O(n) complexity (average case, per tweet) using set.
tweet = "we begin our story in new york"
words = set(tweet.split())
for attribute in attributes:
if attribute in words:
found_attributes.append(attribute)
If your phrases are multiple words long, and you want to be clever, consider creating a dictionary with the words as keys and sets of the positions as values
positions = {"we": {0}, "our": {2}, "begin": {1} ... }
Then loop through the words in each attribute, and see if they are in sequential order. This should be O(1) for each lookup making it O(n) for the total number of words in your attributes. Creating the positions dictionary is O(m) (where m is the number of words in the tweet) since you iterate through enumerated(tweet.split()) to make it. Keep in mind you still have to do this for every tweet in your list.
Tweets are more or less of constant size size, though, so asymptotic analysis is not very useful here. If you are looking for relationships between two arrays of data, it will usually be at a minimum O(p*q) where p and q are the dimensions of the data. I don’t think you can do much better than quadratic time if both your attributes and your list of tweets are of variable size. However I suspect your attributes list is of “constant” size, so this may effectively be linear in time.

Categories

Resources