R/apcluster and skilearn - python

I have been involved in analysis using a software called depict which includes affinity propagation analysis in Python.
I am keen to implement a counterpart using R/apcluster for additional analysis. It seems both use correlation but the results are slightly different. Is that possible to get to the bottom of this? Thanks very much.
af_obj = AffinityPropagation(affinity = 'precomputed', max_iter=10000, convergence_iter=1000) # using almost only default parameters
print "Affinity Propagation parameters:"
for param, val in af_obj.get_params().items():
print "\t{}: {}".format(param, val)
print "Perfoming Affinity Propagation.."
af = af_obj.fit(matrix_corr)
as in Python: https://github.com/jinghuazhao/PW-pipeline/blob/master/files/network_plot.py
require(apcluster)
apres <- apcluster(corSimMat,tRaw,details=TRUE)
as in R:
https://github.com/jinghuazhao/PW-pipeline/blob/master/files/network.R
J
Jing hua

It would be great to have all functionality of the R package apcluster available in Python!
To answer your questions regarding different results:
First of all, check whether the correlation/similarity matrixes are the same.
Also note that the results are not 100% deterministic, since a small amount of random noise is added internally.
You would have to check all parameters of the two implementations if they are all the same. Obviously, you do not get the same results for both implementations if you use default parameters. But this is only an issue if the defaults are exactly the same. As far as I know, they are not. The default damping parameter, for instance, is not the same.
I hope that helps.

Related

Why xgboost.get_booster().get_score() doesn't return any value for one of the variables?

A few questions about feature importance in xgboost in python:
I'm trying to print the feature importance using xgboost.get_booster().get_score(). However, the function sometimes doesn't return anything for some variables. Does that mean the score for those variables is 0?
Why the results of xgboost.get_booster().get_score() looks so different from xgboost.feature_importance_ ?
About the importance_type in get_score(); what would be a a reasonable option for a multi-class classification task? It seems the default in python is "weight", but according to this article importance_type = 'gain' makes more sense (I agree). However, when I use importance_type = 'gain' the results don't make as much sense as the default value for importance_type.
Thank you!

Using scipy minimize with constraint on one parameter

I am using a scipy.minimize function, where I'd like to have one parameter only searching for options with two decimals.
def cost(parameters,input,target):
from sklearn.metrics import mean_squared_error
output = self.model(parameters = parameters,input = input)
cost = mean_squared_error(target.flatten(), output.flatten())
return cost
parameters = [1, 1] # initial parameters
res = minimize(fun=cost, x0=parameters,args=(input,target)
model_parameters = res.x
Here self.model is a function that performs some matrix manipulation based on the parameters. Input and target are two matrices. The function works the way I want to, except I would like to have parameter[1] to have a constraint. Ideally I'd just like to give an numpy array, like np.arange(0,10,0.01). Is this possible?
In general this is very hard to do as smoothness is one of the core-assumptions of those optimizers.
Problems where some variables are discrete and some are not are hard and usually tackled either by mixed-integer optimization (working good for MI-linear-programming, quite okay for MI-convex-programming although there are less good solvers) or global-optimization (usually derivative-free).
Depending on your task-details, i recommend decomposing the problem:
outer-loop for np.arange(0,10,0.01)-like fixing of variable
inner-loop for optimizing, where this variable is fixed
return the model with the best objective (with status=success)
This will effect in N inner-optimizations, where N=state-space of your to fix-var.
Depending on your task/data, it might be a good idea to traverse the fixing-space monotonically (like using np's arange) and use the solution of iteration i as initial-point for the problem i+1 (potentially less iterations needed if guess is good). But this is probably not relevant here, see next part.
If you really got 2 parameters, like indicated, this decomposition leads to an inner-problem with only 1 variable. Then, don't use minimize, use minimize_scalar (faster and more robust; does not need an initial-point).

Error in scipy wilcoxon signed rank test for equal series?

I have a problem with the results of scipy.wilcoxon signed rank test;
x1=[29.39958, 29.21756, 29.350915, 29.34911, 29.212635]
sp.wilcoxon(x1,x1,zero_method="wilcox",correction=True)
returns statistic=0.0, pvalue=nan
But with zero_method="pratt", it returns statistic=0.0, pvalue=0.043114446783075355 instead.
I think there is a mistake there.
The statistics (for z-score, if I am not mistaken) are the same, but the results are different for the p-value.
Am I right and scipy wrong or am I missing something?!
I wanted to check with another module, but neither alglib's (http://www.alglib.net/hypothesistesting/wilcoxonsignedrank.php) or statmodels' (http://statsmodels.sourceforge.net/devel/generated/statsmodels.sandbox.descstats.sign_test.html?highlight=wilcoxon) allows for pratt correction, which I needed as it was deemed as more conservative...
Could you also advice for alternative modules for python stats? Thanks

ScipyOptimizer gives incorrect optimization result

I am running a non-linear optimization problem in OpenMDAO, which I know the optimal solution of (I just want to verify the solution). I am using SLSQP driver configuration of ScipyOptimizer from openmdao.api.
I have 3 design variables A, B and C, their respective design-spaces (Amin to Amax for A and so on) and a single objective function Z. As I said, I know the optimal values for all the three design variables (let's call them Asol, Bsol and Csol) which yield the minimum value of Z (call it Zsol).
When I run this problem, I get a value for Z which is larger than Zsol, signifying that it is not an optimal solution. When I assign Csol to C and run the problem with only A and B as the design variables, I get the value of Z which is much closer to Zsol and which is actually lesser than what I got earlier (in 3 design variable scenario).
Why am I observing this behavior? Shouldn't ScipyOptimizer give the same solution in both the cases?
EDIT: Adding some code..
from openmdao.api import IndepVarComp, Group, Problem
from openmdao.api import ScipyOptimizer
class RootGroup(Group):
def __init__(self):
super(RootGroup, self).__init__()
self.add('desvar_f', IndepVarComp('f', 0.08))
self.add('desvar_twc', IndepVarComp('tool_wear_compensation', 0.06))
self.add('desvar_V', IndepVarComp('V', 32.0))
# Some more config (adding components, connections etc.)
class TurningProblem_singlepart(Problem):
def __init__(self):
super(TurningProblem_singlepart, self).__init__()
self.root = RootGroup()
self.driver = ScipyOptimizer()
self.driver.options['optimizer'] = 'SLSQP'
self.driver.add_desvar('desvar_f.f', lower=0.08, upper=0.28)
self.driver.add_desvar('desvar_twc.tool_wear_compensation', lower=0.0, upper=0.5)
self.driver.add_desvar('desvar_V.V', lower=32.0, upper=70.0)
self.driver.add_objective('Inverse_inst.comp_output')
# Other config
This code gives me incorrect result. When I remove desvar_twc from both the classes, and assign it with its optimal value (from the solution I have), I get fairly correct result i.e. the answer for objective function which is lesser than the previous scenario.
Without seeing your actual model, we can't say anything for sure. However, it is NOT the case that a local optimizer's solution is independent of the starting condition in general. That is only case if the problem is convex. So I would guess that your problem is not convex, and you'r running into local optima.
You can try to get around this by using the COBYLA optimizer instead of SLSQP, which in my experience can manage to jump over some local optima better. But if your problem is really bumpy, then I would suggest you switch to NSGA-II or ALPSO from the pyopt-sparse library. These are heuristic based optimizers that do a good job of finding the "biggest hill", though they don't always climb all the way to the top of it (they don't converge as tightly). The heuristic algorithms are also generally more expensive than the gradient based methods.

How to optimize operations on large (75,000 items) sets of booleans in Python?

There's this script called svnmerge.py that I'm trying to tweak and optimize a bit. I'm completely new to Python though, so it's not easy.
The current problem seems to be related to a class called RevisionSet in the script. In essence what it does is create a large hashtable(?) of integer-keyed boolean values. In the worst case - one for each revision in our SVN repository, which is near 75,000 now.
After that it performs set operations on such huge arrays - addition, subtraction, intersection, and so forth. The implementation is the simplest O(n) implementation, which, naturally, gets pretty slow on such large sets. The whole data structure could be optimized because there are long spans of continuous values. For example, all keys from 1 to 74,000 might contain true. Also the script is written for Python 2.2, which is a pretty old version and we're using 2.6 anyway, so there could be something to gain there too.
I could try to cobble this together myself, but it would be difficult and take a lot of time - not to mention that it might be already implemented somewhere. Although I'd like the learning experience, the result is more important right now. What would you suggest I do?
You could try doing it with numpy instead of plain python. I found it to be very fast for operations like these.
For example:
# Create 1000000 numbers between 0 and 1000, takes 21ms
x = numpy.random.randint(0, 1000, 1000000)
# Get all items that are larger than 500, takes 2.58ms
y = x > 500
# Add 10 to those items, takes 26.1ms
x[y] += 10
Since that's with a lot more rows, I think that 75000 should not be a problem either :)
Here's a quick replacement for RevisionSet that makes it into a set. It should be much faster. I didn't fully test it, but it worked with all of the tests that I did. There are undoubtedly other ways to speed things up, but I think that this will really help because it actually harnesses the fast implementation of sets rather than doing loops in Python which the original code was doing in functions like __sub__ and __and__. The only problem with it is that the iterator isn't sorted. You might have to change a little bit of the code to account for this. I'm sure there are other ways to improve this, but hopefully it will give you a good start.
class RevisionSet(set):
"""
A set of revisions, held in dictionary form for easy manipulation. If we
were to rewrite this script for Python 2.3+, we would subclass this from
set (or UserSet). As this class does not include branch
information, it's assumed that one instance will be used per
branch.
"""
def __init__(self, parm):
"""Constructs a RevisionSet from a string in property form, or from
a dictionary whose keys are the revisions. Raises ValueError if the
input string is invalid."""
revision_range_split_re = re.compile('[-:]')
if isinstance(parm, set):
print "1"
self.update(parm.copy())
elif isinstance(parm, list):
self.update(set(parm))
else:
parm = parm.strip()
if parm:
for R in parm.split(","):
rev_or_revs = re.split(revision_range_split_re, R)
if len(rev_or_revs) == 1:
self.add(int(rev_or_revs[0]))
elif len(rev_or_revs) == 2:
self.update(set(range(int(rev_or_revs[0]),
int(rev_or_revs[1])+1)))
else:
raise ValueError, 'Ill formatted revision range: ' + R
def sorted(self):
return sorted(self)
def normalized(self):
"""Returns a normalized version of the revision set, which is an
ordered list of couples (start,end), with the minimum number of
intervals."""
revnums = sorted(self)
revnums.reverse()
ret = []
while revnums:
s = e = revnums.pop()
while revnums and revnums[-1] in (e, e+1):
e = revnums.pop()
ret.append((s, e))
return ret
def __str__(self):
"""Convert the revision set to a string, using its normalized form."""
L = []
for s,e in self.normalized():
if s == e:
L.append(str(s))
else:
L.append(str(s) + "-" + str(e))
return ",".join(L)
Addition:
By the way, I compared doing unions, intersections and subtractions of the original RevisionSet and my RevisionSet above, and the above code is from 3x to 7x faster for those operations when operating on two RevisionSets that have 75000 elements. I know that other people are saying that numpy is the way to go, but if you aren't very experienced with Python, as your comment indicates, then you might not want to go that route because it will involve a lot more changes. I'd recommend trying my code, seeing if it works and if it does, then see if it is fast enough for you. If it isn't, then I would try profiling to see what needs to be improved. Only then would I consider using numpy (which is a great package that I use quite frequently).
For example, all keys from 1 to 74,000 contain true
Why not work on a subset? Just 74001 to the end.
Pruning 74/75th of your data is far easier than trying to write an algorithm more clever than O(n).
You should rewrite RevisionSet to have a set of revisions. I think the internal representation for a revision should be an integer and revision ranges should be created as needed.
There is no compelling reason to use code that supports python 2.3 and earlier.
Just a thought. I used to do this kind of thing using run-coding in binary image manipulation. That is, store each set as a series of numbers: number of bits off, number of bits on, number of bits off, etc.
Then you can do all sorts of boolean operations on them as decorations on a simple merge algorithm.

Categories

Resources