What does the Brown clustering algorithm output mean?

What does the Brown clustering algorithm output mean? - python

I've ran the brown-clustering algorithm from https://github.com/percyliang/brown-cluster and also a python implementation https://github.com/mheilman/tan-clustering. And they both give some sort of binary and another integer for each unique token. For example:
0 the 6
10 chased 3
110 dog 2
1110 mouse 2
1111 cat 2
What does the binary and the integer mean?
From the first link, the binary is known as a bit-string, see http://saffron.deri.ie/acl_acl/document/ACL_ANTHOLOGY_ACL_P11-1053/
But how do I tell from the output that dog and mouse and cat is one cluster and the and chased is not in the same cluster?

If I understand correctly, the algorithm gives you a tree and you need to truncate it at some level to get clusters. In case of those bit strings, you should just take first L characters.
For example, cutting at the second character gives you two clusters
10 chased
11 dog
11 mouse
11 cat
At the third character you get
110 dog
111 mouse
111 cat
The cutting strategy is a different subject though.

In Percy Liang's implementation (https://github.com/percyliang/brown-cluster), the -C parameter allows you to specify the number of word clusters. The output contains all the words in the corpus, together with a bit-string annotating the cluster and the word frequency in the following format: <bit string> <word> <word frequency>. The number of distinct bit strings in the output equals the number of desired clusters and the words with the same bit string belong to the same cluster.

Change your running : ./wcluster --text input.txt --c 3
--c number
this number means the number of cluster, and the default is 50. You can't distinguish the different cluster of words because the default input has only three sentences. Change 50 clusters to 3 clusters and you can tell the difference.
I enter three tweets into the input and give 3 as the cluster parameter

The integers are counts of how many times the word is seen in the document. (I have tested this in the python implementation.)
From the comments at the top of the python implementation:
Instead of using a window (e.g., as in Brown et al., sec. 4), this
code computed PMI using the probability that two randomly selected
clusters from the same document will be c1 and c2. Also, since the
total numbers of cluster tokens and pairs are constant across pairs,
this code use counts instead of probabilities.
From the code in the python implementation we see that it outputs the word, the bit string and the word counts.
def save_clusters(self, output_path):
with open(output_path, 'w') as f:
for w in self.words:
f.write("{}\t{}\t{}\n".format(w, self.get_bitstring(w),
self.word_counts[w]))

My guess is:
According to Figure 2 in Brown et al 1992, the clustering is hierarchical and to get from the root to each word "leaf" you have to make an up/down decision. If up is 0 and down is 1, you can represent each word as a bit string.
From https://github.com/mheilman/tan-clustering/blob/master/class_lm_cluster.py :
# the 0/1 bit to add when walking up the hierarchy
# from a word to the top-level cluster

Related

SequenceMatcher - finding the two most similar elements of two or more lists of data

I was trying to compare a set of strings to an already defined set of strings.
For example, you want to find the addressee of a letter, which text is digitalized via OCR.
There is an array of adresses, which has dictionaries as elements.
Each element, which is unique, contains ID, Name, Street, ZIP Code and City. This list will be 1000 entries long.
Since OCR scanned text can be inaccurate, we need to find the best matching candidates of strings with the list, which contains the addresses.
The text is 750 words long. We reduce the number of words by using an appropiate filter function, which firstly splits by whitespaces, stripts more whitespaces from each element, deletes all words less then 5 characters long and removes duplicats; the resulting list is 200 words long.
Since each addressee has 4 strings (Name Street, Zip code and city) and the remaining letter ist 200 words long, my comparrisson has to run 4 * 1000 * 200
= 800'000 times.
I have used python with medium success. Matches have correctly been found. However, the algorithm takes a long time to process a lot of letters (up to 50 hrs per 1500 letters). List comprehension has been applied. Is there a way to correctly (and not unessesary) implement multithreading? What if this application needs to run on a low spec server? My 6 core CPU does not complain about such tasks, however, I do not know how much time it will take to process a lot of documents on a small AWS instance.
>> len(addressees)
1000
>> addressees[0]
{"Name": "John Doe", "Zip": 12345, "Street": "Boulevard of broken dreams 2", "City": "Stockholm"}
>> letter[:5] # already filtered
["Insurance", "Taxation", "Identification", "1592212", "St0ckhlm", "Mozart"]
>> from difflib import SequenceMatcher
>> def get_similarity_per_element(addressees, letter):
"""compare the similarity of each word in the letter with the addressees"""
ratios = []
for l in letter:
for a in addressee.items():
ratios.append(int(100 * SequenceMatcher(None, a, l).ratio())) # using ints for faster arithmatic
return max(ratios)
>> get_similarity_per_element(addressees[0], letter[:5]) # percentage of the most matching word in the letter with anything from the addressee
82
>> # then use this method to find all addressents with the max matching ratio
>> # if only one is greater then the others -> Done
>> # if more then one, but less then 3 are equal -> Interactive Promt -> Done
>> # else -> mark as not sortable -> Done.
I expected a faster processing for each document. (1 minute max), not 50 hrs per 1500 letters. I am sure this is the bottleneck, since the other tasks are working fast and flawless.
Is there a better (faster) way to do this?

A few quick tips:
1) Let me know how long does it take to do quick_ratio() or real_quick_ratio() instead of ratio()
2) Invert the order of the loops and use set_seq2 and set_seq1 so that SequenceMatcher reuses information
for a in addressee.items():
s = SequenceMatcher()
s.set_seq2(a)
for l in letter:
s.set_seq1(l)
ratios.append(int(100 * s.ratio()))
But a better solution would be something like #J_H describes

You want to recognize inputs that are similar to dictionary words, e.g. "St0ckholm" -> "Stockholm". Transposition typos should be handled. Ok.
Possibly you would prefer to set autojunk=False. But a quadratic or cubic algorithm sounds like trouble if you're in a hurry.
Consider the Anagram Problem, where you're asked if an input word and a dictionary word are anagrams of one another. The straightforward solution is to compare the sorted strings for equality. Let's see if we can adapt that idea into a suitable data structure for your problem.
Pre-process your dictionary words into canonical keys that are easily looked up, and hang a list of one or more words off of each key. Use sorting to form the key. So for example we would have:
'dgo' -> ['dog', 'god']
Store this map sorted by key.
Given an input word, you want to know if exactly that word appears in the dictionary, or if a version with limited edit distance appears in the dictionary. Sort the input word and probe the map for 1st entry greater or equal to that. Retrieve the (very short) list of candidate words and evaluate the distance between each of them and your input word. Output the best match. This happens very quickly.
For fuzzier matching, use both the 1st and 2nd entries >= target, plus the preceding entry, so you have a larger candidate set. Also, so far this approach is sensitive to deletion of "small" letters like "a" or "b", due to ascending sorting. So additionally form keys with descending sort, and probe the map for both types of key.
If you're willing to pip install packages, consider import soundex, which deliberately discards information from words, or import fuzzywuzzy.

How to generate the result of bigrams with highest probabilities with a list of individual alphabetical strings as input

I am learning natural language processing for bigram topic. At this stage, I am having difficulty in the Python computation, but I try.
I will be using this corpus that has not been subjected to tokenization as my main raw dataset. I can generate the bigram results using nltk module. However, my question is how to compute in Python to generate the bigrams containing more than two specific words. More specifically, I wish to find all the bigrams, which are available in corpus_A, that contain words from the word_of_interest.
corpus = ["he is not giving up so easily but he feels lonely all the time his mental is strong and he always meet new friends to get motivation and inspiration to success he stands firm for academic integrity when he was young he hope that santa would give him more friends after he is a grown up man he stops wishing for santa clauss to arrival he and his friend always eat out but they clean their hand to remove sand first before eating"]
word_of_interest = ['santa', 'and', 'hand', 'stands', 'handy', 'sand']
I want to get the bigram for each of the individual words from the list of word_of_interest. Next, I want to get the frequency for each bigram available based on their appearance in the corpus_A. With the frequency available, I want to sort and print out the bigram based on their probability from highest to lower.
I have tried out codes from on-line search but it does not give me an output. The codes are mentioned below:
for i in corpus:
bigrams_i = BigramCollocationFinder.from_words(corpus, window_size=5)
bigram_j = lambda i[x] not in i
x += 1
print(bigram_j)
Unfortunately, the output did not return what I am planning to achieve.
Please advice me. The output that I want will have the bigram with the specific words from the word_of_interest and their probabilities sorted as shown below.
[((santa, clauss), 0.89), ((he, and), 0.67), ((stands, firm), 0.34))]

You can try this code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(ngram_range=(2,2),use_idf=False)
corpus = ["he is not giving up so easily but he feels lonely all the time his mental is strong and he always meet new friends to get motivation and inspiration to success he stands firm for academic integrity when he was young he hope that santa would give him more friends after he is a grown up man he stops wishing for santa clauss to arrival he and his friend always eat out but they clean their hand to remove sand first before eating"]
word_of_interest = ['santa', 'and', 'hand', 'stands', 'handy', 'sand']
matrix = vec.fit_transform(corpus).toarray()
vocabulary = vec.get_feature_names()
all_bigrams = []
all_frequencies = []
for word in word_of_interest:
for bigram in vocabulary:
if word in bigram:
index = vocabulary.index(bigram)
tuple_bigram = tuple(bigram.split(' '))
frequency = matrix[:,index].sum()
all_bigrams.append(tuple_bigram)
all_frequencies.append(frequency)
df = pd.DataFrame({'bigram':all_bigrams,'frequency':all_frequencies})
df.sort_values('frequency',inplace=True)
df.head()
The output is a pandas dataframe showing the bigrams sorted by frequency.
bigram frequency
0 (for, santa) 0.109764
19 (stands, firm) 0.109764
18 (he, stands) 0.109764
17 (their, hand) 0.109764
16 (hand, to) 0.109764
The rationale here is that TfidfVectorizer counts how many times a token is present in each document of a corpus, then computes the term-specific frequency, and then stores this information in a column associated with that token. The index of that column is the same as the index of the associated word in the vocabulary that is retrieved with the method .get_feature_names() on the vectorizer already fit.
Then you just have to select all rows from the matrix containing the relative frequencies of the tokens, and sum along the column of interest.
The double-nested for loop is not ideal though, and there may be a more efficient implementation for it. The issue is that get_feature_names returns not tuples, but a list of strings in the form ['first_token second_token',].
I would be interested in seeing a better implementation of the second half of the above code.

clustering similar words and then mapping clusters into numbers in python

I'm familiar with k-means to cluster data points, but not text.. So I have one column of words( some rows has only one word and some have more etc.) in cvs format, which I want to cluster those which have similar word or more, and then mapping those cluster to numbers as index, those index numbers needed to be added as a second column. I know there is scipy packahes and word2vec in python, but this is the first time for me dealing with clustering text.. Any idea on how to do this?? Any code examples will be appreciated
Edit:What I want are not the similar words in meaning, I want the similar like the exact text, for example: we have three words in different three rows: Heart attack , Heart failure, Heart broken.. for example .. I want those rows to be in one cluster cause they have the common word " Heart... And by the way, all the rows are connected with each other somehow, so what I really want is to cluster the exact words
from csv import DictReader
import sets
### converting my cvs file into list!!
with open("export.csv") as f:
my_list = [row["BASE_NAME"] for row in DictReader(f)]
#print(my_list)
## having every word in the cvs file
Set = list()
for item in my_list:
MySet = list(set(item.split(' ')))
Set.append(MySet)
#print(Set)
cleanlist = []
[cleanlist.append(x) for x in Set if x not in cleanlist]
print(cleanlist[1])
#print(cleanlist)
###my_list = ['abc-123', 'def-456', 'ghi-789', 'abc-456']
#for item in my_list:
for i in xrange(len(cleanlist)):
# matching = [s for s in my_list if cleanlist[i] in s]
# matching = [x for x in my_list if cleanlist[i] in x]
matching = any( cleanlist[[i]] in item for item in my_list)
print(matching)
Sample of my_list is ['Carbon Monoxide (Blood)', 'Carbon Monoxide Poisoning', 'Carbonic anhydrase inhibitor administered']
Sample of cleanlist is [['Antibody', 'Cardiolipin'], ['Cardiomegaly'], ['Cardiomyopathy'], ['Cardiopulmonary', 'Resuscitation', '(CPR)'], ['Diet', 'Cardiovascular'], ['Disease', 'Cardiovascular']]
SOLVED[Now I'm having problems, My cleanlist is not containing only one item for each index, which makes the compare for matching hard, how to fix that??]
????Also, I want to create a list for each time of comparing, so for each comparing of the clean list, I want to create one list which will have the similar words between them,,, any help with that please??

I'm a newbie of machine-learning, but maybe I can give you some advise.
let's suppose we have one row:
rowText = "mary have a little lamb"
then split the word and put it in a set:
MySet = set(rowText.split(' '))
now we can add more row to this set, finally we got a big set including all words.
MySet.add(newRowText.split(' '))
now, we should remove some not so important words such as "a", "an", "is", "are",etc.
Convert the set back to list with order, check the length of the List. now we can create a N-Dimension space, for example, the total List is ["Mary", "has", "lamb", "happy"]. the row is "Mary has a little lamb", so we can convert the sentence to [1, 1, 1, 0].
Now you can do Cluster now. right?
If you found the "Mary" is very important, you can adjust the weight of Mary like [2,1,1,0].
The precess way you can refer from Bayes as my opinion.

Clustering text is hard, and most approaches do not work well. Clustering single words essentially requires a lot of background knowledge.
If you had longer text, you could measure similarity by the words they have in common.
But for single words, this approach does not work.
Consider:
Apple
Banana
Orange
Pear
Pea
To a human who knows a lot, apple and pear are supposedly the two most similar. To a computer who has only these 3 to 6 bytes of string data, pear and pea are the most similar words.
You see: language is a lot about background knowledge and associations. A computer who cannot associate both "apple" and "pear" with "fruit growing on a tree, usually green outside and white inside, black seeds in the center and a stem on top, about the size of a palm usually" cannot recognize what these have in common, and thus cannot cluster them.

For clustering you need some kind of distance measurement. I suggest using the Hamming Distance (see https://en.wikipedia.org/wiki/Hamming_distance). I think it's common to use it to measure similarity between two words.
Edit:
for your examples this would mean
Heart attack
Heart failure => dist 7
Heart attack
Heart broken => dist 6
Heart failure
Heart broken => dist 7
Heart broken
Banana => dist 12

Matching 2 short descriptions and returning a confidence level

I have some data that I get from the Banks using Yodlee and the corresponding transaction messages on the mobile. Both have some description in them - short descriptions.
For example -
string1 = "tatasky_TPSL MUMBA IND"
string2 = "tatasky_TPSL"
They can be matched if one is a completely inside the other. However, some strings like
string1 = "T.G.I Friday's"
string1 = "TGI Friday's MUMBA MAH"
Still need to be matched. Is there a y algorithm which gives a confidence level in matching 2 descriptions ?

You might want to use Normalized edit distance also called levenstien distance levenstien distance wikipedia. So after getting levenstien distance between two strings, you can normalize it by dividing by the length of longest string (or average of those two strings). This normalised socre can act as confidense. You can find some 4-5 python packages of calculating levenstien distance. You can try it online as well edit distance calculator
Alternatively one simple solution is algorithm called longest common subsequence, which can be used here

string comparison in python but not Levenshtein distance (I think)

I found a crude string comparison in a paper I am reading done as follows:
The equation they use is as follows (extracted from the paper with small word changes to make it more general and readable)
I have tried to explain a bit more in my own words since the description by the author is not very clear (using an example by the author)
For example for 2 sequences ABCDE and BCEFA, there are two possible graphs
graph 1) which connects B with B C with C and E with E
graph 2) connects A with A
I cannot connect A with A when I am connecting the other three (graph 1) since that would be crossing lines (imagine you draw lines between B-B, C-C and E-E); that is the line inking A-A will cross the lines linking B-B, C-C and E-E.
So these two sequences result in 2 possible graphs; one has 3 connections (BB, CC and EE) and the other only one (AA) then I calculate the score d as given by the equation below.
Consequently, to define the degree of similarity between two
penta-strings we calculate the distance d between them. Aligning the
two penta-strings, we look for all the identities between their
characters, wherever these may be located. If each identity is
represented by a link between both penta-strings, we define a graph
for this pair. We call any part of this graph a configuration.
Next, we retain all of those configurations in which there is no character
cross pairing (the meaning is explained in my example above, i.e., no crossings of links between identical characters and only those graphs are retained).
Each of these is then evaluated as a function of the
number p of characters related to the graph, the shifting Δi for the
corresponding pairs and the gap δij between connected characters of
each penta-string. The minimum value is chosen as characteristic and
is called distance d: d Min(50 – 10p + ΣΔi + Σδij) Although very rough,
this measure is generally in good agreement with the qualitative eye
guided estimation. For instance, the distance between abcde and abcfg
is 20, whereas that between abcde and abfcg is 23 =(50 – 30 + 1 +2).
I am confused as to how to go about doing this. Any suggestions to help me would be much appreciated.
I tried the Levenshtein and also simple sequence alignment as used in protein sequence comparison
The link to the paper is:
http://peds.oxfordjournals.org/content/16/2/103.long
I could not find any information on the first author, Alain Figureau and my emails to MA Soto have not been answered (as of today).
Thank you

Well, it's definitely not Levenshtein:
>>> from nltk import metrics
>>> metrics.distance.edit_distance('abcde','abcfg')
2
>>> metrics.distance.edit_distance('abcde','abfcg')
3
>>> help(metrics.distance.edit_distance)
Help on function edit_distance in module nltk.metrics.distance:
edit_distance(s1, s2)
Calculate the Levenshtein edit-distance between two strings.
The edit distance is the number of characters that need to be
substituted, inserted, or deleted, to transform s1 into s2. For
example, transforming "rain" to "shine" requires three steps,
consisting of two substitutions and one insertion:
"rain" -> "sain" -> "shin" -> "shine". These operations could have
been done in other orders, but at least three steps are needed.
#param s1, s2: The strings to be analysed
#type s1: C{string}
#type s2: C{string}
#rtype C{int}

Just after the text block you cite, there is a reference to a previous paper from the same authors : Secondary Structure of Proteins and Three-dimensional Pattern Recognition. I think it is worth to look into it if there is no explanantion of the distance (I'm not at work so I haven't the access to the full document).
Otherwise, you can also try to contact directly the authors : Alain Figureau seems to be an old-school French researcher with no contact whatsoever (no webpage, no e-mail, no "social networking",..) so I advise to try contacting M.A. Soto , whose e-mail is given at the end of the paper. I think they will give you the answer you're looking for : the experiment's procedure has to be crystal clear in order to be repeatable, it's part of the scientific method in research.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.