I have several human genes of interest (with their gene name, gene id, sequences, and accession number information) that are homologous to their corresponding mouse genes in terms of their sequence similarity. I am curious to know in which region of the human genes the sequences similarity exists. For example, I want to know if the sequence similarity is in (a) promoters (proximal/intermediate/distal promoters, (b) exon (first exon/other exons), (c) intron (first intron/other introns), (d) untranslated region (5’ UTR/ 3’ UTR), (e) intergenic regions (downstream/distal intergenic).
Is there an efficient way to find in what region of the genes the homology exists? Are there any R/Python packages available for the task? If not, anyone wants to collaborate to make one.
Related
Assume that we have 1000 words (A1, A2,..., A1000) in a dictionary. As fa as I understand, in words embedding or word2vec method, it aims to represent each word in the dictionary by a vector where each element represents the similarity of that word with the remaining words in the dictionary. Is it correct to say there should be 999 dimensions in each vector, or the size of each word2vec vector should be 999?
But with Gensim Python, we can modify the value of "size" parameter for Word2vec, let's say size = 100 in this case. So what does "size=100" mean? If we extract the output vector of A1, denoted (x1,x2,...,x100), what do x1,x2,...,x100 represent in this case?
It is not the case that "[word2vec] aims to represent each word in the dictionary by a vector where each element represents the similarity of that word with the remaining words in the dictionary".
Rather, given a certain target dimensionality, like say 100, the Word2Vec algorithm gradually trains word-vectors of 100-dimensions to be better and better at its training task, which is predicting nearby words.
This iterative process tends to force words that are related to be "near" each other, in rough proportion to their similarity - and even further the various "directions" in this 100-dimensional space often tend to match with human-perceivable semantic categories. So, the famous "wv(king) - wv(man) + wv(woman) ~= wv(queen)" example often works because "maleness/femaleness" and "royalty" are vaguely consistent regions/directions in the space.
The individual dimensions, alone, don't mean anything. The training process includes randomness, and over time just does "whatever works". The meaningful directions are not perfectly aligned with dimension axes, but angled through all the dimensions. (That is, you're not going to find that a v[77] is a gender-like dimension. Rather, if you took dozens of alternate male-like and female-like word pairs, and averaged all their differences, you might find some 100-dimensional vector-dimension that is suggestive of the gender direction.)
You can pick any 'size' you want, but 100-400 are common values when you have enough training data.
I am trying to do some string matching using the Levenshtein algorithm for closest words on businesses. (In python but language won't make a huge difference)
An example query would be
search = 'bna'
lat & lon are close by the result I am looking for.
There is a pub right by the latitude and longitude called BNA Brewing Co. by searching BNA my hopes would be that that shows up first (as bna == bna)
I have tried two different way
m = min([editdistance.eval(search, place_split) for place_split in place.name.split(' ')
if place_split not in string.punctuation])
returns without ranking based on geographical distance, only levenshtein distance
Coffee & Books In Town Center
Talk 'n' Coffee
Raggedy Ann & Andy's
and with taking into account geographical distance, secondary to levenshtein
Shapers Hair Salon & Spa
Amora Day Spa
Pure Esthetics and Micro-Pigmentation
And
m = editdistance.eval(search, place.name)
The first one returns without ranking based on geographical distance, only levenshtein distance
KFC
MOO
A&W
and with taking into account geographical distance, secondary to levenshtein
A&W
A&W
KFC
So you can see that neither way are returning anything close to BNA Brewing Co.
What kind of logic do I have to use to get it to return something when the search terms exactly matches one of the place names in my database?
Recall that Levenshtein distances count the number of substitutions, additions and deletions required to transform one string into another. Because of this, they often are minimized when comparing strings of similar length (because even if a lot of substitutions are required, you don't have to add or remove a bunch of characters). You can see this playing out in your second example where your best outputs all are the same length as your search string (len("bna") == len("A&W")).
If your search string is always going to be a single word, then your idea to calculate the distance for each word in the string is a good one since each word is more likely to be a similar length to your search string. However, currently you are doing a case sensitive comparison which means that editdistance.eval('bna', 'BNA') == 3 which I'm guessing you don't want.
try:
m = min([editdistance.eval(search.lower(), place_split.lower()) for place_split in place.name.split(' ') if place_split not in string.punctuation])
which should give you a case insensitive search.
I have some data that I get from the Banks using Yodlee and the corresponding transaction messages on the mobile. Both have some description in them - short descriptions.
For example -
string1 = "tatasky_TPSL MUMBA IND"
string2 = "tatasky_TPSL"
They can be matched if one is a completely inside the other. However, some strings like
string1 = "T.G.I Friday's"
string1 = "TGI Friday's MUMBA MAH"
Still need to be matched. Is there a y algorithm which gives a confidence level in matching 2 descriptions ?
You might want to use Normalized edit distance also called levenstien distance levenstien distance wikipedia. So after getting levenstien distance between two strings, you can normalize it by dividing by the length of longest string (or average of those two strings). This normalised socre can act as confidense. You can find some 4-5 python packages of calculating levenstien distance. You can try it online as well edit distance calculator
Alternatively one simple solution is algorithm called longest common subsequence, which can be used here
I've ran the brown-clustering algorithm from https://github.com/percyliang/brown-cluster and also a python implementation https://github.com/mheilman/tan-clustering. And they both give some sort of binary and another integer for each unique token. For example:
0 the 6
10 chased 3
110 dog 2
1110 mouse 2
1111 cat 2
What does the binary and the integer mean?
From the first link, the binary is known as a bit-string, see http://saffron.deri.ie/acl_acl/document/ACL_ANTHOLOGY_ACL_P11-1053/
But how do I tell from the output that dog and mouse and cat is one cluster and the and chased is not in the same cluster?
If I understand correctly, the algorithm gives you a tree and you need to truncate it at some level to get clusters. In case of those bit strings, you should just take first L characters.
For example, cutting at the second character gives you two clusters
10 chased
11 dog
11 mouse
11 cat
At the third character you get
110 dog
111 mouse
111 cat
The cutting strategy is a different subject though.
In Percy Liang's implementation (https://github.com/percyliang/brown-cluster), the -C parameter allows you to specify the number of word clusters. The output contains all the words in the corpus, together with a bit-string annotating the cluster and the word frequency in the following format: <bit string> <word> <word frequency>. The number of distinct bit strings in the output equals the number of desired clusters and the words with the same bit string belong to the same cluster.
Change your running : ./wcluster --text input.txt --c 3
--c number
this number means the number of cluster, and the default is 50. You can't distinguish the different cluster of words because the default input has only three sentences. Change 50 clusters to 3 clusters and you can tell the difference.
I enter three tweets into the input and give 3 as the cluster parameter
The integers are counts of how many times the word is seen in the document. (I have tested this in the python implementation.)
From the comments at the top of the python implementation:
Instead of using a window (e.g., as in Brown et al., sec. 4), this
code computed PMI using the probability that two randomly selected
clusters from the same document will be c1 and c2. Also, since the
total numbers of cluster tokens and pairs are constant across pairs,
this code use counts instead of probabilities.
From the code in the python implementation we see that it outputs the word, the bit string and the word counts.
def save_clusters(self, output_path):
with open(output_path, 'w') as f:
for w in self.words:
f.write("{}\t{}\t{}\n".format(w, self.get_bitstring(w),
self.word_counts[w]))
My guess is:
According to Figure 2 in Brown et al 1992, the clustering is hierarchical and to get from the root to each word "leaf" you have to make an up/down decision. If up is 0 and down is 1, you can represent each word as a bit string.
From https://github.com/mheilman/tan-clustering/blob/master/class_lm_cluster.py :
# the 0/1 bit to add when walking up the hierarchy
# from a word to the top-level cluster
I found a crude string comparison in a paper I am reading done as follows:
The equation they use is as follows (extracted from the paper with small word changes to make it more general and readable)
I have tried to explain a bit more in my own words since the description by the author is not very clear (using an example by the author)
For example for 2 sequences ABCDE and BCEFA, there are two possible graphs
graph 1) which connects B with B C with C and E with E
graph 2) connects A with A
I cannot connect A with A when I am connecting the other three (graph 1) since that would be crossing lines (imagine you draw lines between B-B, C-C and E-E); that is the line inking A-A will cross the lines linking B-B, C-C and E-E.
So these two sequences result in 2 possible graphs; one has 3 connections (BB, CC and EE) and the other only one (AA) then I calculate the score d as given by the equation below.
Consequently, to define the degree of similarity between two
penta-strings we calculate the distance d between them. Aligning the
two penta-strings, we look for all the identities between their
characters, wherever these may be located. If each identity is
represented by a link between both penta-strings, we define a graph
for this pair. We call any part of this graph a configuration.
Next, we retain all of those configurations in which there is no character
cross pairing (the meaning is explained in my example above, i.e., no crossings of links between identical characters and only those graphs are retained).
Each of these is then evaluated as a function of the
number p of characters related to the graph, the shifting Δi for the
corresponding pairs and the gap δij between connected characters of
each penta-string. The minimum value is chosen as characteristic and
is called distance d: d Min(50 – 10p + ΣΔi + Σδij) Although very rough,
this measure is generally in good agreement with the qualitative eye
guided estimation. For instance, the distance between abcde and abcfg
is 20, whereas that between abcde and abfcg is 23 =(50 – 30 + 1 +2).
I am confused as to how to go about doing this. Any suggestions to help me would be much appreciated.
I tried the Levenshtein and also simple sequence alignment as used in protein sequence comparison
The link to the paper is:
http://peds.oxfordjournals.org/content/16/2/103.long
I could not find any information on the first author, Alain Figureau and my emails to MA Soto have not been answered (as of today).
Thank you
Well, it's definitely not Levenshtein:
>>> from nltk import metrics
>>> metrics.distance.edit_distance('abcde','abcfg')
2
>>> metrics.distance.edit_distance('abcde','abfcg')
3
>>> help(metrics.distance.edit_distance)
Help on function edit_distance in module nltk.metrics.distance:
edit_distance(s1, s2)
Calculate the Levenshtein edit-distance between two strings.
The edit distance is the number of characters that need to be
substituted, inserted, or deleted, to transform s1 into s2. For
example, transforming "rain" to "shine" requires three steps,
consisting of two substitutions and one insertion:
"rain" -> "sain" -> "shin" -> "shine". These operations could have
been done in other orders, but at least three steps are needed.
#param s1, s2: The strings to be analysed
#type s1: C{string}
#type s2: C{string}
#rtype C{int}
Just after the text block you cite, there is a reference to a previous paper from the same authors : Secondary Structure of Proteins and Three-dimensional Pattern Recognition. I think it is worth to look into it if there is no explanantion of the distance (I'm not at work so I haven't the access to the full document).
Otherwise, you can also try to contact directly the authors : Alain Figureau seems to be an old-school French researcher with no contact whatsoever (no webpage, no e-mail, no "social networking",..) so I advise to try contacting M.A. Soto , whose e-mail is given at the end of the paper. I think they will give you the answer you're looking for : the experiment's procedure has to be crystal clear in order to be repeatable, it's part of the scientific method in research.