How to use distancematrix function from Biopython? - python

I would like to calculate the distance matrix (using genetic distance function) on a data set using http://biopython.org/DIST/docs/api/Bio.Cluster.Record-class.html#distancematrix, but I seem to keep getting errors, typically telling me the rank is not of 2. I'm not actually sure what it wants as an input since the documentation never says and there are no examples online.
Say I read in some aligned gene sequences:
SingleLetterAlphabet() alignment with 7 rows and 52 columns
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRL...SKA COATB_BPIKE/30-81
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKL...SRA Q9T0Q8_BPIKE/1-52
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRL...SKA COATB_BPI22/32-83
AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPM13/24-72
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPZJ2/1-49
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA Q9T0Q9_BPFD/1-49
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKL...SRA COATB_BPIF1/22-73
which would be done by
data = Align.read("dataset.fasta","fasta")
But the distance matrix in Cluster.Record class does not accept this. How can I get it to work! ie
dist_mtx = distancematrix(data)

The short answer: You don't.
From the documentation:
A Record stores the gene expression data and related information
The Cluster object is used for gene expression data and not for MSA.
I would recommend using an external tool like MSARC which runs in Python as well.

Related

(scipy.stats.qmc) How to do multiple randomized Quasi Monte Carlo

I want to generate many randomized realizations of a low discrepancy sequence thanks to scipy.stat.qmc. I only know this way, which directly provide a randomized sequence:
from scipy.stats import qmc
ld = qmc.Sobol(d=2, scramble=True)
r = ld.random_base2(m=10)
But if I run
r = ld_deterministic.random_base2(m=10)
twice I get
The balance properties of Sobol' points require n to be a power of 2. 2048 points have been previously generated, then: n=2048+2**10=3072. If you still want to do this, the function 'Sobol.random()' can be used.
It seems like using Sobol.random() is discouraged from the doc.
What I would like (and it should be faster) is to first get
ld = qmc.Sobol(d=2, scramble=False)
then to generate like a 1000 scrambling (or other randomization method) from this initial series.
It avoids having to regenerate the Sobol sequence for each sample and just do scrambling.
How to that?
It seems to me like it is the proper way to do many Randomized QMC, but I might be wrong and there might be other ways.
As the warning suggests, Sobol' is a sequence meaning that there is a link between with the previous samples. You have to respect the properties of 2^m. It's perfectly fine to use Sobol.random() if you understand how to use it, this is why we created Sobol.random_base2() which prints a warning if you try to do something that would break the properties of the sequence. Remember that with Sobol' you cannot skip 10 points and then sample 5 or do arbitrary things like that. If you do that, you will not get the convergence rate guaranteed by Sobol'.
In your case, what you want to do is to reset the sequence between the draws (Sobol.reset). A new draw will be different from the previous one if scramble=True. Another way (using a non scrambled sequence for instance) is to sample 2^k and skip the first 2^(k-1) points then you can sample 2^n with n<k-1.

Plotting OpenStreetMap relations does not generate continous lines

All,
I have been working on an index of all MTB trails worldwide. I'm a Python person so for all steps involved I try to use Python modules.
I was able to grab relations from the OSM overpass API like this:
from OSMPythonTools.overpass import Overpass
overpass = Overpass()
def fetch_relation_coords(relation):
rel = overpass.query('rel(%s); (._;>;); out;' % relation)
return rel
rel = fetch_relation_coords("6750628")
I'm choosing this particular relation (6750628) because it is one of several that is resulting in discontinuous (or otherwise erroneous) plots.
I process the "rel" object to get a pandas.DataFrame like this:
elements = pd.DataFrame(rel.toJSON()['elements'])
"elements" looks like this:
The Elements pandas.DataFrame contains rows of the types "relation" (1 in this case), several of the type "way" and many of the type "node". It was my understanding that I would use the "relation" row, "members" column to extract the order of the ways (which point to the nodes), and use that order to make a list of the latitudes and longitudes of the nodes (for later use in leaflet), in the correct order, that is, the order that leads to continuous path on a map.
However, that is not the case. For this particular relation, I end up with the following plot:
If we compare that with the way the relation is displayed on openstreetmap.org itself, we see that it goes wrong (focus on the middle, eastern part of the trail). I have many examples of this happening, although there are also a lot of relations that do display correctly.
So I was wondering, what am I missing? Are there nodes with tags that need to be ignored? I already tried several things, including leaving out nodes with any tags, this does not help. Somewhere my processing is wrong but I don't understand where.
You need to sort the ways inside the relation yourself. Only a few relation types require sorted members, for example some route relations such as route=bus and route=tram. Others may have sorted members, such as route=hiking, route=bicycle etc., but they don't require them. Various other relations, such as boundary relations (type=boundary), usually don't have sorted members.
I'm pretty sure there are already various tools for sorting relation members, obviously this includes the openstreetmap.org website where this relation is shown correctly. Unfortunately I'm not able to point you to these tools but I guess a little bit research will reveal others.
If I opt to just plot the different way on top of each other, I indeed get a continuous plot (index contains the indexes for all nodes per way):
In the Database I would have preferred to have the nodes sorted anyway because I could use them to make a GPX file on the fly. But I guess I did answer my own question with this approach, thank you #scai for tipping me into this direction.
You could have a look at shapely.ops.linemerge, which seems to be smart enough to chain multiple linestrings even if the directions are inconsistent. For example (adapted from here):
from shapely import geometry, ops
line_a = geometry.LineString([[0,0], [1,1]])
line_b = geometry.LineString([[1,0], [2,5], [1,1]]) # <- switch direction
line_c = geometry.LineString([[1,0], [2,0]])
multi_line = geometry.MultiLineString([line_a, line_b, line_c])
merged_line = ops.linemerge(multi_line)
print(merged_line)
# output:
LINESTRING (0 0, 1 1, 2 5, 1 0, 2 0)
Then you just need to make sure that the endpoints match exactly.

How do I use python loops to iterate through the same code with different arguments?

Moving from SAS to Python, I am trying to replicate a SAS macro-type process using input parameters to generate different iterations of code for each loop. In particular, I am trying to binarize continuous variables for modeling (regardless of the merit that may have). What I'm doing at the moment looks as follows:
Some sample data:
import pandas as pd
data=[[2,20],[4,50],[6,75],[1,80],[3,40]]
df=pd.DataFrame(data,columns=['var1','var2'])
Then I run the following:
df['var1_f'] = pd.cut(df['var1'], [0,1,2,3,4,5,7,np.inf], include_lowest=True, labels=['a','b','c','d','e','f','g'])
df['var2_f'] = pd.cut(df['var2'], [-np.inf,0,62,73,81,98,np.inf], include_lowest=True, labels=['a','b','c','d','e','f'])
.
.
.
df1=pd.get_dummies(df,columns=['var1_f'])
df1=pd.get_dummies(df1,columns=['var2_f'])
.
.
.
The above results in a table that contains the original DataFrame, but now has columns appended that take values 1 or 0, depending on whether the continuous variable falls into a particular band. That's great. There must be a better way to do this, rather than having potentially a dozen or so entries that are structurally identical, just with different arguments for variable names and cutoff/label values?
The SAS equivalent would involve replacing "varx_f", "varx", the cutoff values and the labels with placeholders that change on each iteration. In this case, I would do that through pre-defined values (as per the values in the above code), rather than dynamically.
How would I go about looping through this with different arguments for each iteration?
Apologies if this is an existing topic (I'm sure it is) - I just haven't been able to find it.
Thanks for reading!

Most suitable clustering method for a dataset containing 10 dimension numerical arrays

I have a data set (~4k samples) of the following structure:
sample type: string - very general
sample sub type: string
sample model number: number - may be None
signature: number array[10]
sampleID: string - unique id
I want to cluster the samples based on the "signature" (I have a function that measures "distance" between one signature to another). So that when I'll encounter a new signature I'll be able to tell to which type/sub type the sample belongs to.
Which algorithm should I use?
P.S. (I am using python and scikit-learn), I also need to somehow visualize the results.
Since you already have a distance function, and yoour data set is tiny, just use HAC, the grandfather of all clustering algorithms.

Implementing Disjoint Set Data Structure in Python

I'm working on a small project involving cluster, and I think the code given here https://www.ics.uci.edu/~eppstein/PADS/UnionFind.py might be a good starting point for my work. However, I have come across a few difficulties implementing it to my work:
If I make a set containing all my clusters cluster=set([0,1,2,3,4,...,99]) (there are 100 points with the numbers labelling them), then I would like to to group the numbers into cluster, do I simply write cluster=UnionFind()? Now what is the data type of cluster?
How can I perform the usual operations for set on cluster? For instance, I would like to read all the points (which may have been grouped together) in cluster, but type print cluster results in <main.UnionFind instance at 0x00000000082F6408>. I would also like to keep adding new elements to cluster, how do I do it? Do I need to write the specific methods for UnionFind()?
How do I know all the members of a group with one of its member is called? For instance, 0,1,3,4 are grouped together, then if I call 3, I want it to print 0,1,3,4, how do I do this?
thanks
Here's a small sample code on how to use the provided UnionFind class.
Initialization
The only way to create a set using the provided class is to FIND it, because it creates a set for a point only when it doesn't find it. You might want to create an initialization method instead.
union_find = UnionFind()
clusters = set([0,1,2,3,4])
for i in clusters:
union_find[i]
Union
# Merge clusters 0 and 1
union_find.union(0, 1)
# Add point 2 to the same set
union_find.union(0, 2)
Find
# Get the set for clusters 0 and 1
print union_find[0]
print union_find[1]
Getting all Clusters
# print all clusters and their sets
for cluster in union_find:
print cluster, union_find[cluster]
Note:
There is no direct way that gets you all the points given a cluster number. You can loop over all the points and pick the ones that have the required cluster number. You might want to modify the given class to support that operation more efficiently.

Categories

Resources